https://dvc.orgGatsbyJSFri, 18 Jul 2025 03:38:11 GMThttps://dvc.org/blog/dvc-ray-part-2https://dvc.org/blog/dvc-ray-part-2Wed, 13 Mar 2024 00:00:00 GMT<p>In <a href="https://dvc.ai/blog/dvc-ray" target="_blank" rel="nofollow noopener noreferrer">Part 1</a> of the tutorial, we explored the basics of setting up and integrating DVC with Ray for distributed machine learning workflows. By leveraging Ray's distributed computing capabilities and DVC's data version control, we establish a robust framework for managing complex ML experiments. This combination allows for enhanced scalability, reproducibility, and collaboration in ML projects.</p> <p>In Part 2, we extend the solution to a Ray Cluster on AWS, demonstrating how to adapt the setup for cloud-based distributed computing. This part involves configuring AWS resources, deploying Ray clusters in the cloud, and running DVC-managed pipelines at scale.</p> <blockquote> <p>We would like to express our gratitude to <a href="https://www.linkedin.com/in/schuh/" target="_blank" rel="nofollow noopener noreferrer">Andreas Schuh</a> from <a href="https://www.heartflow.com/" target="_blank" rel="nofollow noopener noreferrer">HeartFlow</a> for his contribution to this solution and for providing ideas and feedback for the blog posts. 🤝</p> </blockquote> <details> <summary>Table of Contents</summary> <ul> <li><a href="#%EF%B8%8Fdesign-scalable-ml-experiments-with-dvc-and-ray">🛠️ Design Scalable ML Experiments with DVC and Ray</a> <ul> <li><a href="#1---technical-challenges-of-running-dvc-in-a-distributed-ray-cluster">1 - Technical challenges of running DVC in a distributed Ray Cluster</a></li> <li><a href="#2---overview-of-the-solution-design">2 - Overview of the Solution Design</a></li> <li><a href="#3---discuss-the-solution-design">3 - Discuss the solution design</a> <ul> <li><a href="#%EF%B8%8Fuse-a-modified-dvclive-logger-to-upload-metrics-to-the-s3">☝️ Use a modified DVCLive logger to upload metrics to the S3</a></li> <li><a href="#%EF%B8%8Fdownload-dvclive-metrics-to-the-dvc-repository-after-the-training-is-complete">☝️ Download DVCLive metrics to the DVC repository after the training is complete</a></li> </ul> </li> </ul> </li> <li><a href="#set-up-and-run-dvc-in-distributed-ray-cluster">🚀 Set Up and Run DVC in Distributed Ray Cluster</a> <ul> <li><a href="#1---prepare-aws-and-dvc-studio-credentials">1 - Prepare <strong>AWS and DVC Studio credentials</strong></a></li> <li><a href="#2---configure-ray-cluster-in-clusteryaml">2 - Configure Ray Cluster in <code>cluster.yaml</code></a> <ul> <li><a href="#set-the-cluster-name-and-auto-scaling-config">Set the cluster name and auto-scaling config</a></li> <li><a href="#set-up-the-docker-image-for-the-head-and-worker-nodes">Set up the Docker image for the head and worker nodes</a></li> <li><a href="#cloud-provider-configuration">Cloud-provider configuration</a></li> <li><a href="#files-or-directories-to-copy-to-the-head-and-worker-nodes">Files or directories to copy to the head and worker nodes</a></li> <li><a href="#additional-commands-to-set-up-nodes">Additional commands to set up nodes</a></li> </ul> </li> <li><a href="#3---start-a-ray-cluster-on-aws">3 - Start a Ray Cluster on AWS</a></li> <li><a href="#4---connect-to-the-head-node-and-set-up-credentials">4 - Connect to the Head Node and Set Up Credentials</a> <ul> <li><a href="#connecting-to-the-cluster">Connecting to the Cluster</a></li> <li><a href="#setting-up-git-credentials">Setting Up Git Credentials</a></li> <li><a href="#run-tests-to-check-the-correct-setup">Run tests to check the correct setup</a></li> </ul> </li> <li><a href="#5---run-dvc-pipelines-on-the-remote-ray-cluster">5 - Run DVC Pipelines on the remote Ray Cluster</a></li> <li><a href="#6---commit--push-experiments">6 - Commit & push experiments</a></li> <li><a href="#7---stop-cluster">7 - Stop Cluster</a></li> </ul> </li> <li><a href="#-summing-up-dvc--ray-integration">🎨 Summing Up: DVC + Ray Integration</a></li> <li><a href="#references">References</a></li> </ul> </details> <h2 id="️design-scalable-ml-experiments-with-dvc-and-ray" style="position:relative;">🛠️ Design Scalable ML Experiments with DVC and Ray<a href="#%EF%B8%8Fdesign-scalable-ml-experiments-with-dvc-and-ray" aria-label="️design scalable ml experiments with dvc and ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Moving from a local setup to deploying a multi-node Ray Cluster on AWS marks a significant shift, bringing forth a range of challenges that necessitate careful consideration. This section dives deep into these intricacies, shedding light on the hurdles encountered when scaling ML workflows to the cloud. We aim to provide a comprehensive analysis of these challenges and introduce refined solutions for a smooth integration of DVC and Ray in distributed environments. Through this exploration, we lay the groundwork for enhancing scalability, efficiency, and seamless operation of ML pipelines on a larger scale.</p> <p><strong>Goals for this section:</strong></p> <ul> <li>Identify and address the technical challenges of running DVC in a distributed Ray cluster.</li> <li>Design an efficient and scalable integration of DVC and Ray in a distributed environment.</li> <li>Propose solutions and best practices for overcoming these challenges.</li> </ul> <h3 id="1---technical-challenges-of-running-dvc-in-a-distributed-ray-cluster" style="position:relative;">1 - Technical challenges of running DVC in a distributed Ray Cluster<a href="#1---technical-challenges-of-running-dvc-in-a-distributed-ray-cluster" aria-label="1 technical challenges of running dvc in a distributed ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let’s outline the scope of the target solution for the following discussion:</p> <ul> <li>A Ray Cluster can add more worker nodes (auto-scaling) on AWS EC2.</li> <li>All jobs are executed only on worker nodes (not on the head node) in Docker containers.</li> <li>The user runs DVC pipelines and commits results on the head node (connected by SSH).</li> <li>During the training, the user should be able to track metrics updated in live mode.</li> <li>Data and models are stored in AWS S3.</li> <li>Code and metadata are versioned with Git.</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/80aeeefacbcd205d9aa91fe53aa52a15/39600/2-challenges.png" alt="Challenges" title="Challenges" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Challenges of running DVC in a distributed Ray Cluster</em></p> <p>Let's review each challenge and its proposed solution:</p> <ol> <li><strong>Auto-Scaling Worker Nodes</strong>: <ul> <li>Challenge: Ensuring seamless integration with Ray's auto-scaling feature to add or remove worker nodes based on workload demand dynamically.</li> <li>Solution: Utilize Ray's built-in auto-scaling functionality, which allows for the dynamic addition and removal of worker nodes as needed.</li> </ul> </li> <li><strong>Execution on Worker Nodes Only</strong>: <ul> <li>Challenge: Ensuring that all jobs, including DVC pipelines and Ray tasks, are executed exclusively on worker nodes to optimize resource utilization. A specific part is a requirement to propagate DVC environment variables to all worker nodes.</li> <li>Solution: Configure the Ray cluster to execute all tasks and jobs exclusively on worker nodes. Monitor the head node's load and use Ray's capabilities to distribute tasks evenly across the worker nodes.</li> </ul> </li> <li><strong>Live Metrics Tracking During Training</strong> <ul> <li>Challenge: Tracking real-time metrics during model training on distributed worker nodes with <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>.</li> <li>Solution: Use DVCLive, a lightweight library compatible with DVC, to track real-time metrics during training sessions. Set up the pipeline to use DVCLive on the rank 0 worker only (as discussed above). Ensure that DVCLive, running on the rank 0 worker, has access to the <a href="https://dvc.org/doc/user-guide/env" target="_blank" rel="nofollow noopener noreferrer">DVC environment variables</a>, including <code>DVC_STUDIO_TOKEN</code>, to log metrics to DVC Studio.</li> </ul> </li> <li><strong>Synchronize DVC pipeline artifacts with the head node.</strong> <ul> <li>Challenge: Ensuring that artifacts generated by DVC pipelines on worker nodes are consistently and efficiently synchronized back to the head node, where they can be versioned and committed to Git and DVC remote storage.</li> <li>Solution: Setup <ul> <li><strong>From Worker to S3</strong>: Set up Ray to use an AWS S3 bucket as a persistent storage to sync artifacts and checkpoints.</li> <li><strong>From S3 to Head Node</strong>: After the distributed pipeline is complete, pull the required artifacts and a model from the persistent storage on S3 to the project repository on the head node.</li> </ul> </li> </ul> </li> </ol> <h3 id="2---overview-of-the-solution-design" style="position:relative;">2 - Overview of the Solution Design<a href="#2---overview-of-the-solution-design" aria-label="2 overview of the solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Here is a diagram that depicts the proposed solution:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ab0b797760f194e14bb4046eb98dff40/39600/3-solution-design-2.png" alt="Solution Design for DVC with Ray in Clouds" title="Solution Design for DVC with Ray in Clouds" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Solution Design for DVC with Ray in Clouds</em></p> <p>The diagram on the slide illustrates the integration of DVC (Data Version Control) and Ray in a cloud-based environment, specifically using AWS EC2 instances. Let's break down the key components and steps outlined in the diagram.</p> <ol> <li>Package project & Provision Ray Cluster: Provision of the Ray cluster on AWS EC2 instances before running experiments. There are a few ways to do this: <ul> <li>Set up <code>cluster.yaml</code> to copy files and directories from the local machine to the head and worker nodes.</li> <li>Pull the code and dependencies from the Git repository or S3 bucket.</li> </ul> </li> <li>Run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>: In a Ray cluster, the head node coordinates tasks and manages resources. It initiates the execution of parallel tasks on worker nodes. Connect to Ray cluster (head node), navigate to the project directory, and run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>.</li> <li>Publish Live Metrics to Studio: <ul> <li>During the execution of <code>train.py</code>, DVCLive handles logging metrics and parameters at Worker(rank=0) to avoid duplication.</li> <li>DataChain Studio visualizes metrics updates in live mode.</li> </ul> </li> <li>Push DVCLive logs from a Worker Node to S3: The current version of the DVCLive logs metrics and artifacts to the filesystem on the rank 0 worker. To make them available in the project repository on the head node after the experiment is complete, a few modifications were made: <ul> <li>Use <code>DVCLiveRayLogger</code> as <a href="https://dvc.org/doc/dvclive/live" target="_blank" rel="nofollow noopener noreferrer">Live</a> - extended with functionality to store metrics in s3</li> <li>Modified Live.next_step() is responsible for uploading <code>/results/dvclive</code> dir to s3 bucket: <code>s3://cse-cloud-version/tutorial-mnist-dvc-ray/</code> every epoch.</li> </ul> </li> <li>Pull DVCLive logs from S3 to the Head Node after completing the experiment.</li> <li>Commit & Push the DVC experiment artifacts and metadata updates.</li> </ol> <h3 id="3---discuss-the-solution-design" style="position:relative;">3 - Discuss the solution design<a href="#3---discuss-the-solution-design" aria-label="3 discuss the solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let’s summarise changes made in scripts to run in a distributed Ray cluster in the cloud:</p> <ul> <li>Use a modified DVCLive logger to upload metrics to the S3 bucket every epoch.</li> <li>Download DVCLive metrics to the DVC repository after the training is complete.</li> </ul> <h4 id="️use-a-modified-dvclive-logger-to-upload-metrics-to-the-s3" style="position:relative;">☝️ Use a modified DVCLive logger to upload metrics to the S3<a href="#%EF%B8%8Fuse-a-modified-dvclive-logger-to-upload-metrics-to-the-s3" aria-label="️use a modified dvclive logger to upload metrics to the s3 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>A modified <code>DVCLiveRayLogger</code> inherits from <code>Live</code> and introduces the ability to push DVCLive metrics directly to an S3 bucket. This is necessary because the code is executed on remote workers, and DVCLive can’t log metrics and artifacts directly to the DVC repository.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">class</span> <span class="token class-name">DVCLiveRayLogger</span><span class="token punctuation">(</span>Live<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> bucket_name<span class="token punctuation">,</span> s3_directory<span class="token punctuation">,</span> <span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token builtin">super</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>__init__<span class="token punctuation">(</span><span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span> self<span class="token punctuation">.</span>bucket_name <span class="token operator">=</span> bucket_name self<span class="token punctuation">.</span>s3_directory <span class="token operator">=</span> s3_directory <span class="token keyword">def</span> <span class="token function">next_step</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> <span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token builtin">super</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token operator">*</span>args<span class="token punctuation">,</span> <span class="token operator">**</span>kwargs<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"\nDVCLiveLogger: Push DVCLive metrics to S3"</span><span class="token punctuation">)</span> upload_to_s3<span class="token punctuation">(</span>self<span class="token punctuation">.</span><span class="token builtin">dir</span><span class="token punctuation">,</span> self<span class="token punctuation">.</span>bucket_name<span class="token punctuation">,</span> self<span class="token punctuation">.</span>s3_directory<span class="token punctuation">,</span><span class="token punctuation">)</span></code></pre></div> <ul> <li>By pushing DVCLive directory to S3, teams can easily share, access, and analyze training progress from anywhere without relying on local file systems.</li> </ul> <h4 id="️download-dvclive-metrics-to-the-dvc-repository-after-the-training-is-complete" style="position:relative;">☝️ Download DVCLive metrics to the DVC repository after the training is complete<a href="#%EF%B8%8Fdownload-dvclive-metrics-to-the-dvc-repository-after-the-training-is-complete" aria-label="️download dvclive metrics to the dvc repository after the training is complete permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>Live object instance created from <code>DVCLiveRayLogger</code> behaves the same way as the original DVCLive. There are a few changes in the configuration:</p> <ul> <li>Set <code>dir="results/dvclive"</code> to ensure that after the training DVC will correctly resolve paths of logged metrics and artifacts.</li> <li>Set <code>bucket_name</code> and <code>s3_directory</code> to save live metrics and artifacts in S3.</li> </ul> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">train_func_per_worker</span><span class="token punctuation">(</span>config<span class="token punctuation">:</span> Dict<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment"># [3] Set up Live object for DVCLive</span> live <span class="token operator">=</span> <span class="token boolean">None</span> <span class="token keyword">if</span> worker_rank <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> <span class="token comment"># Initialize DVC Live</span> <span class="token keyword">from</span> src<span class="token punctuation">.</span>live <span class="token keyword">import</span> DVCLiveRayLogger <span class="token keyword">as</span> Live live <span class="token operator">=</span> Live<span class="token punctuation">(</span> <span class="token builtin">dir</span><span class="token operator">=</span><span class="token string">"results/dvclive"</span><span class="token punctuation">,</span> save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> bucket_name <span class="token operator">=</span> <span class="token string">"cse-cloud-version"</span><span class="token punctuation">,</span> s3_directory <span class="token operator">=</span> <span class="token string">"tutorial-mnist-dvc-ray/dvclive"</span><span class="token punctuation">,</span> <span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">train</span><span class="token punctuation">(</span>params<span class="token punctuation">:</span> <span class="token builtin">dict</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment"># Pull DVCLive logs from S3</span> s3_directory <span class="token operator">=</span> <span class="token string">"tutorial-mnist-dvc-ray/dvclive"</span> download_from_s3<span class="token punctuation">(</span>bucket_name<span class="token punctuation">,</span> s3_directory<span class="token punctuation">,</span> <span class="token string">'results/dvclive/'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> train<span class="token punctuation">(</span>params<span class="token punctuation">)</span></code></pre></div> <ul> <li>At every training epoch, <code>live.next_step()</code> pushes the <code>results/dvclive</code> directory to the S3 bucket.</li> <li>After the training, use <code>download_from_s3()</code> to download DVCLive metrics to the <code>results/dvclive/</code> in the DVC repository.</li> </ul> <h2 id="set-up-and-run-dvc-in-distributed-ray-cluster" style="position:relative;">🚀 Set Up and Run DVC in Distributed Ray Cluster<a href="#set-up-and-run-dvc-in-distributed-ray-cluster" aria-label="set up and run dvc in distributed ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote> <p>💡 Note: Navigate to the <code>cloud</code> branch in the repository</p> </blockquote> <p>This section of the tutorial provides a step-by-step guide on how to set up and run a DVC pipeline on a Ray cluster hosted on AWS. The integration of DVC with Ray on AWS allows for scaling machine learning workflows, leveraging cloud resources for distributed processing.</p> <p><strong>Goals for this section:</strong></p> <ul> <li>Guide you through the steps to set up and run the example on a Ray cluster hosted on AWS.</li> <li>Explain specific solutions and best practices.</li> </ul> <h3 id="1---prepare-aws-and-dvc-studio-credentials" style="position:relative;">1 - Prepare <strong>AWS and DVC Studio credentials</strong><a href="#1---prepare-aws-and-dvc-studio-credentials" aria-label="1 prepare aws and dvc studio credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This example uses a simple AWS access configuration. Prepare AWS credentials for use with Ray (or any other application that requires AWS access) and store them in a specific file (<code>~/.aws/ray-credentials</code>) on a local machine. In the next step, you’ll configure Ray to use this file.</p> <p>For example, use the following CLI script to store AWS secrets to <code>~/.aws/ray-credentials</code>:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">echo</span> <span class="token string">"[default] aws_access_key_id = ASIAU7... aws_secret_access_key = Fdpgl... aws_session_token = IQoJb3JpZ... "</span> <span class="token operator">></span> ~/.aws/ray-credentials</code></pre></div> <p>To track metrics with DVC Studio, Save your <a href="https://dvc.org/doc/studio/user-guide/account-and-billing#client-access-tokens" target="_blank" rel="nofollow noopener noreferrer">DVC Studio client access token</a> to a <code>.dvc/config.local</code> file. Git or DVC does not track this file. In the next step, you’ll configure Ray to use this file to provision the head and worker nodes.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">dvc config <span class="token parameter variable">--local</span> studio.token isat_2BlrAu0aileSH<span class="token punctuation">..</span>.</code></pre></div> <h3 id="2---configure-ray-cluster-in-clusteryaml" style="position:relative;">2 - Configure Ray Cluster in <code>cluster.yaml</code><a href="#2---configure-ray-cluster-in-clusteryaml" aria-label="2 configure ray cluster in clusteryaml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To initiate a Ray cluster on AWS, you will use a configuration file named <code>cluster.yaml</code>, which outlines the specifications of your AWS setup, including instance types, the number of nodes, and other settings. The <code>cluster.yaml</code> is big and has a lot of comments. Let’s highlight only parts specific to the current solution design.</p> <h4 id="set-the-cluster-name-and-auto-scaling-config" style="position:relative;">Set the cluster name and auto-scaling config<a href="#set-the-cluster-name-and-auto-scaling-config" aria-label="set the cluster name and auto scaling config permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">cluster_name</span><span class="token punctuation">:</span> tutorial<span class="token punctuation">-</span>mnist<span class="token punctuation">-</span>dvc<span class="token punctuation">-</span>ray <span class="token key atrule">max_workers</span><span class="token punctuation">:</span> <span class="token number">2</span> <span class="token key atrule">upscaling_speed</span><span class="token punctuation">:</span> <span class="token number">1.0</span></code></pre></div> <ul> <li> <p>In the Ray cluster configuration for the <code>tutorial-mnist-dvc-ray</code> cluster, the <code>cluster_name</code> specifies a unique identifier for the cluster, distinguishing it from other clusters you might be running. This name is used in managing and tracking the cluster's resources.</p> </li> <li> <p>The <code>max_workers</code> setting defines the maximum number of worker nodes the cluster can scale up to in addition to the head node. It's set to <code>2</code> here, meaning the cluster can run up to two worker nodes concurrently to process tasks.</p> </li> <li> <p>The <code>upscaling_speed</code> parameter controls how quickly the cluster can scale up by adding more worker nodes when there's an increase in load or tasks. Set at <code>1.0</code>, the autoscaler can increase the cluster size by up to 100% of the currently running nodes at each scaling operation.</p> </li> </ul> <h4 id="set-up-the-docker-image-for-the-head-and-worker-nodes" style="position:relative;">Set up the Docker image for the head and worker nodes<a href="#set-up-the-docker-image-for-the-head-and-worker-nodes" aria-label="set up the docker image for the head and worker nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>Using Docker enables you to run your distributed applications in a consistent and controlled environment, leveraging Docker's containerization to manage dependencies and system settings across all nodes seamlessly.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">docker</span><span class="token punctuation">:</span> <span class="token key atrule">image</span><span class="token punctuation">:</span> <span class="token string">'rayproject/ray-ml@sha256:fa8c69ae055b92bf2f97e22c6a96ea835be60afa69c224d6e1275c3040833d0a'</span> <span class="token key atrule">container_name</span><span class="token punctuation">:</span> <span class="token string">'ray_container'</span> <span class="token key atrule">pull_before_run</span><span class="token punctuation">:</span> <span class="token boolean important">True</span> <span class="token key atrule">run_options</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>ulimit nofile=65536<span class="token punctuation">:</span><span class="token number">65536</span></code></pre></div> <p>This Ray cluster configuration segment specifies Docker settings for running tasks across all nodes:</p> <ul> <li><code>image</code> The Docker image used for containers on all nodes, identified by its SHA256 digest for consistency.</li> <li><code>container_name</code> The name for Docker containers, set as <code>ray_container</code>.</li> </ul> <h4 id="cloud-provider-configuration" style="position:relative;">Cloud-provider configuration<a href="#cloud-provider-configuration" aria-label="cloud provider configuration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>This Ray cluster configuration outlines the setup for running distributed applications on AWS, specifying both cloud provider settings and instance configurations, including a unique approach for the head node.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">provider</span><span class="token punctuation">:</span> <span class="token key atrule">type</span><span class="token punctuation">:</span> aws <span class="token key atrule">region</span><span class="token punctuation">:</span> us<span class="token punctuation">-</span>west<span class="token punctuation">-</span><span class="token number">2</span> <span class="token key atrule">availability_zone</span><span class="token punctuation">:</span> us<span class="token punctuation">-</span>west<span class="token punctuation">-</span>2a<span class="token punctuation">,</span>us<span class="token punctuation">-</span>west<span class="token punctuation">-</span>2b <span class="token key atrule">cache_stopped_nodes</span><span class="token punctuation">:</span> <span class="token boolean important">True</span> <span class="token key atrule">available_node_types</span><span class="token punctuation">:</span> <span class="token key atrule">ray.head.default</span><span class="token punctuation">:</span> <span class="token key atrule">resources</span><span class="token punctuation">:</span> <span class="token punctuation">{</span> <span class="token key atrule">'CPU'</span><span class="token punctuation">:</span> <span class="token number">0</span> <span class="token punctuation">}</span> <span class="token key atrule">node_config</span><span class="token punctuation">:</span> <span class="token key atrule">InstanceType</span><span class="token punctuation">:</span> m5.2xlarge <span class="token key atrule">BlockDeviceMappings</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">DeviceName</span><span class="token punctuation">:</span> /dev/sda1 <span class="token key atrule">Ebs</span><span class="token punctuation">:</span> <span class="token key atrule">VolumeSize</span><span class="token punctuation">:</span> <span class="token number">160</span> <span class="token key atrule">VolumeType</span><span class="token punctuation">:</span> gp3 <span class="token key atrule">ray.worker.default</span><span class="token punctuation">:</span> <span class="token key atrule">min_workers</span><span class="token punctuation">:</span> <span class="token number">1</span> <span class="token key atrule">max_workers</span><span class="token punctuation">:</span> <span class="token number">2</span> <span class="token key atrule">resources</span><span class="token punctuation">:</span> <span class="token punctuation">{</span><span class="token punctuation">}</span> <span class="token key atrule">node_config</span><span class="token punctuation">:</span> <span class="token key atrule">InstanceType</span><span class="token punctuation">:</span> m5.2xlarge <span class="token key atrule">InstanceMarketOptions</span><span class="token punctuation">:</span> <span class="token key atrule">MarketType</span><span class="token punctuation">:</span> spot <span class="token key atrule">BlockDeviceMappings</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">DeviceName</span><span class="token punctuation">:</span> /dev/sda1 <span class="token key atrule">Ebs</span><span class="token punctuation">:</span> <span class="token key atrule">VolumeSize</span><span class="token punctuation">:</span> <span class="token number">160</span> <span class="token key atrule">VolumeType</span><span class="token punctuation">:</span> gp3</code></pre></div> <p>This configuration establishes a robust and cost-efficient Ray cluster on AWS, leveraging both on-demand and spot instances for worker nodes to optimize costs and performance:</p> <ul> <li><strong>Head Node</strong> (<code>ray.head.default</code>): Configured to use <code>m5.2xlar</code> instances, with a custom block device mapping for increased EBS volume size (160 GB, gp3 type). Interestingly, the <code>resources</code> for the head node are set to <code>{"C": 0}</code>, indicating it should not be used for computation-intensive tasks, focusing instead on cluster management and coordination.</li> <li><strong>Worker Nodes</strong> (<code>ray.worker.default</code>): Also set to use <code>m5.2xlar</code> instances with similar storage configurations as a default. Worker nodes can run on spot instances to reduce costs, and their CPU and GPU resources are auto-detected, allowing them to be allocated for computational tasks. The configuration supports scaling between 1 and 2 worker nodes dynamically.</li> <li>Setting <code>{CPU: 0}</code> for the head node is a strategic choice to ensure it does not run compute-intensive tasks. The head node manages the cluster's operations, including task scheduling and resource allocation.</li> </ul> <h4 id="files-or-directories-to-copy-to-the-head-and-worker-nodes" style="position:relative;">Files or directories to copy to the head and worker nodes<a href="#files-or-directories-to-copy-to-the-head-and-worker-nodes" aria-label="files or directories to copy to the head and worker nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>The <code>file_mounts</code> configuration facilitates the replication of a consistent working environment across the cluster by ensuring all nodes have the necessary code, configurations, and credentials. This setup supports seamless distributed execution of tasks, including data processing, training machine learning models, and interacting with cloud services.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">file_mounts</span><span class="token punctuation">:</span> <span class="token punctuation">{</span> <span class="token key atrule">'/home/ray/tutorial-mnist-dvc-ray'</span><span class="token punctuation">:</span> <span class="token string">'.'</span><span class="token punctuation">,</span> <span class="token key atrule">'/home/ray/tutorial-mnist-dvc-ray/.dvc/config.local'</span><span class="token punctuation">:</span> <span class="token string">'./.dvc/config.local'</span><span class="token punctuation">,</span> <span class="token key atrule">'/home/ray/.aws/credentials'</span><span class="token punctuation">:</span> <span class="token string">'~/.aws/ray-credentials'</span> <span class="token punctuation">}</span> <span class="token key atrule">rsync_filter</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token string">'.gitignore'</span></code></pre></div> <ul> <li><code>/home/ray/tutorial-mnist-dvc-ray</code>: This entry maps the current local directory (denoted by <code>"."</code>) the remote directory <code>/home/ray/tutorial-mnist-dvc-ray</code> on both the head and worker nodes. It's useful for transferring the entire project (including <code>.git</code> directory), which includes code, scripts, and potentially small data files or configuration files that are necessary for the execution of the pipeline.</li> <li><code>/home/ray/tutorial-mnist-dvc-ray/.dvc/config.local</code>: This entry indicates that the local DVC configuration file, <code>.dvc/conf.local</code>, should be explicitly copied to the corresponding path on the remote nodes. This file includes an access token for DVC Studio and is thus excluded from Git tracking as a security measure. Given that the <code>rsync_filter</code> patterns employed in the configuration are designed to omit all Git-ignored files — encompassing both data files and the DVC cache — it becomes necessary to list the <code>config.loc</code> file explicitly. This step ensures the file is transferred despite the filter, thereby maintaining access to DVC Studio across all nodes in the cluster.</li> <li><code>/home/ray/.aws/credentials</code>: This maps a custom AWS credentials file from the local machine (<code>~/.aws/ray-credentials</code>) to the standard AWS credentials path (<code>/home/ray/.aws/credentials</code>) on the remote nodes. This setup is essential for enabling AWS SDKs and CLI tools running on the remote nodes to authenticate with AWS services using the provided credentials.</li> </ul> <blockquote> <p>💡 Note: This example uses the simplified approach to configure access to AWS resources and DVC Studio. For the production setup, it's crucial to:</p> <ul> <li>Ensure that sensitive information, especially credentials, is handled securely. Use IAM roles for EC2 instances where possible to avoid copying AWS credentials.</li> <li>Minimize the size of transferred directories to speed up the cluster initialization process. Consider excluding large datasets or output directories if they're not needed on every node or can be accessed from a shared storage service like Amazon S3.</li> </ul> </blockquote> <h4 id="additional-commands-to-set-up-nodes" style="position:relative;">Additional commands to set up nodes<a href="#additional-commands-to-set-up-nodes" aria-label="additional commands to set up nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>The <code>setup_commands</code> section in the Ray cluster configuration outlines a series of shell commands executed on all nodes (both head and worker nodes) during their initialization phase. These commands are crucial for preparing the nodes with your application's necessary software and libraries.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">setup_commands</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>U ray<span class="token punctuation">[</span>default<span class="token punctuation">]</span> <span class="token punctuation">-</span> pip install dvc<span class="token punctuation">[</span>s3<span class="token punctuation">]</span>==3.43.1 dvclive==3.41.1 <span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>U pyOpenSSL==24.0.0</code></pre></div> <p>Here’s a breakdown:</p> <ul> <li><code>pip insta dvc[s3]==3.43.1 dvclive==3.41.1</code>**: Installs specific versions of DVC (Data Version Control) with S3 support and DVCLive. Specifying versions ensures consistency in running the tutorial example.</li> <li><code>pip insta -U pyOpenSSL==24.0.0</code>: Updates the pyOpenSSL library to a specific version after the DVC installation. This is a specific requirement for this example to ensure the consistency of the Python dependencies.</li> </ul> <h3 id="3---start-a-ray-cluster-on-aws" style="position:relative;">3 - Start a Ray Cluster on AWS<a href="#3---start-a-ray-cluster-on-aws" aria-label="3 start a ray cluster on aws permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Run the following command to start your Ray cluster as defined in your <code>cluster.yaml</code> file:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray up cluster.yaml</code></pre></div> <p>You can access the Ray dashboard once your Ray cluster is running. This dashboard provides a real-time view of your cluster's status, including resource utilization, task progress, and logs.</p> <p>To open the Ray dashboard, use:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray dashboard cluster.yaml</code></pre></div> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ef0f001747c181994c94bef89591b2de/39600/4-dashboard.png" alt="Ray Dashboard" title="Ray Dashboard" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ray Dashboard</em></p> <h3 id="4---connect-to-the-head-node-and-set-up-credentials" style="position:relative;">4 - Connect to the Head Node and Set Up Credentials<a href="#4---connect-to-the-head-node-and-set-up-credentials" aria-label="4 connect to the head node and set up credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Once your Ray cluster is provisioned and all nodes are correctly set up with the necessary software, the next step involves connecting to the head node to configure access credentials for GitHub, Amazon S3, and other services like DVC Studio. These credentials are essential for version control, data storage, and continuous integration and deployment (CI/CD) processes.</p> <h4 id="connecting-to-the-cluster" style="position:relative;">Connecting to the Cluster<a href="#connecting-to-the-cluster" aria-label="connecting to the cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>To initiate a secure connection to the head node of your Ray cluster, use the following command. This command utilizes the cluster configuration defined in <code>cluster.yaml</code>, providing you with a terminal session on the head node:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Connect to cluster</span> ray attach cluster.yaml</code></pre></div> <h4 id="setting-up-git-credentials" style="position:relative;">Setting Up Git Credentials<a href="#setting-up-git-credentials" aria-label="setting up git credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>Once connected to the head node, configure Git with your username and email to enable commits to your repositories. Additionally, an access token can be set up for GitHub to securely push and pull without using a password. Replace <code><your_username></code> with your GitHub username and <code><your_email></code> with your email associated with GitHub, and <code><your_github_pat></code> with your GitHub Personal Access Token (PAT).</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> config <span class="token parameter variable">--global</span> user.name <span class="token string">"<your_username>"</span> <span class="token function">git</span> config <span class="token parameter variable">--global</span> user.email <span class="token string">"<your_email>"</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">GITHUB_ACCESS_TOKEN</span><span class="token operator">=</span><span class="token operator"><</span>your_github_pat<span class="token operator">></span></code></pre></div> <p>Use the access token to update the repository's remote URL for authentication. This step assumes you have cloned the repository and are inside the repository directory.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> remote set-url origin https://your_username:<span class="token variable">${GITHUB_ACCESS_TOKEN}</span>@github.com/your_username/tutorial-mnist-dvc-ray.git</code></pre></div> <h4 id="run-tests-to-check-the-correct-setup" style="position:relative;">Run tests to check the correct setup<a href="#run-tests-to-check-the-correct-setup" aria-label="run tests to check the correct setup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>Run a few test scripts to ensure AWS credentials are correctly set up on the cluster for accessing S3 services.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span> python src/test_scripts/test_s3.py</code></pre></div> <blockquote> <p>The example scripts are inside the <code>~/tutorial-mnist-dvc-ray</code> directory</p> </blockquote> <h3 id="5---run-dvc-pipelines-on-the-remote-ray-cluster" style="position:relative;">5 - Run DVC Pipelines on the remote Ray Cluster<a href="#5---run-dvc-pipelines-on-the-remote-ray-cluster" aria-label="5 run dvc pipelines on the remote ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Navigate to the <code>tutorial-mnist-dvc-ray</code> directory and run a new experiment</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span> dvc exp run <span class="token parameter variable">-f</span></code></pre></div> <p>This will start the pipeline, running the <code>tune</code> and <code>train</code> stages as defined in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file, utilizing distributed computation with Ray.</p> <p>You may see live updates of metrics and plots in <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/28d12948c82876a84b7ae984c2a59f6d/39600/5-dvc-studio.png" alt="Live Metrics Tracking with DVC Studio" title="Live Metrics Tracking with DVC Studio" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Live Metrics Tracking with DVC Studio</em></p> <p>This setup with DVC and DVCLive offers a structured approach to monitoring model performance through metrics tracking and visualization. It aids in understanding the model's behavior over training, facilitating decisions on model adjustments or improvements. Moreover, after the experiment is complete, you may change the plot template, add new plots, or customize the existing ones to suit your specific requirements if needed.</p> <h3 id="6---commit--push-experiments" style="position:relative;">6 - Commit & push experiments<a href="#6---commit--push-experiments" aria-label="6 commit push experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Once you've completed an experiment and are ready to share or preserve the results, DVC provides a seamless workflow to list, select, and commit the outcomes of your experiments. Here’s how to manage and share your experiment results using DVC and Git.</p> <p>Use <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> to get an overview of all experiments, including their metrics and parameters.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token punctuation">(</span>base<span class="token punctuation">)</span> ray@ip-172-31-41-217:~/tutorial-mnist-dvc-ray$ dvc exp show ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────<span class="token operator">></span> Experiment Created loss accuracy step tune.run_tune tune.epoch_size tune.test_size tune.results_dir<span class="token operator">></span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────<span class="token operator">></span> workspace - <span class="token number">0.38723</span> <span class="token number">0.8602</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span> cloud-remote 02:17 PM <span class="token number">0.3951</span> <span class="token number">0.8542</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span> ├── dbcdc38 <span class="token punctuation">[</span>broad-teas<span class="token punctuation">]</span> 06:22 AM <span class="token number">0.38723</span> <span class="token number">0.8602</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span> └── 11e273e <span class="token punctuation">[</span>metal-sick<span class="token punctuation">]</span> 06:21 AM <span class="token number">0.3951</span> <span class="token number">0.8542</span> <span class="token number">4</span> True <span class="token number">512</span> <span class="token number">256</span> results/tune <span class="token operator">></span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────<span class="token operator">></span> <span class="token punctuation">(</span>END<span class="token punctuation">)</span></code></pre></div> <p>After identifying the successful experiment (e.g., <code>broad-teas</code>), you can use DVC to create a new branch for this experiment, facilitating version control and collaboration.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">dvc exp branch broad-tea</code></pre></div> <p>Next, push the newly created branch to your remote Git repository and upload artifacts to the DVC remote storage.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> checkout broad-teas-branch <span class="token function">git</span> push origin broad-teas-branch dvc push</code></pre></div> <h3 id="7---stop-cluster" style="position:relative;">7 - Stop Cluster<a href="#7---stop-cluster" aria-label="7 stop cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Turn off the remote cluster when not in use to save money and reduce environmental impact!</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray down cluster.yaml</code></pre></div> <h2 id="-summing-up-dvc--ray-integration" style="position:relative;">🎨 Summing Up: DVC + Ray Integration<a href="#-summing-up-dvc--ray-integration" aria-label=" summing up dvc ray integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The DVC + Ray integration presents a comprehensive solution to the challenges of running machine learning experiments at scale. By addressing specific issues related to auto-scaling, execution optimization, live metrics tracking, and data synchronization, this setup ensures that machine learning teams can focus on innovation and experimentation backed by a robust, scalable, and efficient infrastructure.</p> <p>Integrating DVC with Ray combines the best data management and distributed computing for machine learning projects. Here's a simplified overview of what we covered:</p> <ol> <li><strong>Setup Ray Cluster</strong>: Configured a Ray cluster to run on AWS, utilizing Docker for consistent environments and specifying node types for resource optimization.</li> <li><strong>Node Provisioning</strong>: Automated the setup of head and worker nodes for a scalable ML experiment environment.</li> <li><strong>Artifact Sync</strong>: Ensured DVC pipeline artifacts were synchronized across the cluster, keeping data and models consistent.</li> <li><strong>Manage Experiments with DVC Studio</strong>: Demonstrated how to use DVC, DVCLive, and DVC Studio for metrics tracking, artifacts versioning, and experiment management.</li> <li><strong>Commit and Share Results</strong>: Highlighted the process of committing experiment results and pushing them to a repository for collaboration and reproducibility.</li> </ol> <p><strong>Key Takeaways</strong>:</p> <ul> <li><strong>Scalability</strong>: Ray and AWS offer a flexible and scalable setup for ML experiments.</li> <li><strong>Reproducibility</strong>: DVC adds data version control, enhancing experiment reproducibility.</li> <li><strong>Automation</strong>: The integration shows how to automate the ML workflow, from setup to experiment tracking.</li> <li><strong>Collaboration</strong>: Using Git and DVC supports effective team collaboration on ML projects.</li> </ul> <blockquote> <p>💡 Did you find this tutorial interesting? Please leave your comments and share your experience with DVC and Ray! Join us on <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> 🙌</p> </blockquote> <h2 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://dvc.org/doc/studio/user-guide/experiments/explore-ml-experiments" target="_blank" rel="nofollow noopener noreferrer">DVC Studio: Explore ML Experiments</a></li> <li><a href="https://docs.ray.io/en/latest/ray-overview/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray docs: Getting Started</a></li> <li><a href="https://www.anyscale.com/blog/ray-common-production-challenges-for-generative-ai-infrastructure" target="_blank" rel="nofollow noopener noreferrer">How Ray solves common production challenges for Generative AI infrastructure</a></li> <li><a href="https://medium.com/samsara-engineering/building-a-modern-machine-learning-platform-with-ray-eb0271f9cbcf" target="_blank" rel="nofollow noopener noreferrer">Building a Modern Machine Learning Platform with Ray</a></li> </ul>https://dvc.org/blog/dvc-rayhttps://dvc.org/blog/dvc-rayTue, 12 Mar 2024 00:00:00 GMT<p>Training models at the scale of the Gemini or GPT-4 models requires advanced tools that manage complexity while ensuring efficiency. This tutorial explores how Data Version Control (DVC) can be a game-changer for ambitious projects. DVC simplifies AI development by automating pipelines, managing versions, and tracking experiments while embracing GitOps for reproducibility. It excels in both local and cloud environments for traditional ML workflows. However, the rise of Generative AI and complex deep learning projects demands scalable, distributed training solutions.</p> <p>This tutorial is divided into two parts. Part 1 sets the foundation for scalable and efficient machine learning workflows by leveraging Ray’s distributed computing capabilities and DVC’s data version control.</p> <p>In <a href="https://dvc.ai/blog/dvc-ray-part-2" target="_blank" rel="nofollow noopener noreferrer">Part 2</a>, we extend the solution to a Ray Cluster on AWS, demonstrating how to adapt the setup for cloud-based distributed computing. This involves configuring AWS resources, deploying Ray clusters in the cloud, and running DVC-managed pipelines at scale.</p> <blockquote> <p>This guide is tailored for ML Engineers and Team Leads in AI projects who aim to speed up training, optimize resources, and ensure reproducibility across distributed environments. I am looking forward to hearing your feedback and improvements! 🙌</p> </blockquote> <blockquote> <p>We would like to express our gratitude to <a href="https://www.linkedin.com/in/schuh/" target="_blank" rel="nofollow noopener noreferrer">Andreas Schuh</a> from <a href="https://www.heartflow.com/" target="_blank" rel="nofollow noopener noreferrer">HeartFlow</a> for his contribution to this solution and for providing ideas and feedback for the blog posts. 🤝</p> </blockquote> <details> <summary>Table Of Contents</summary> <ul> <li><a href="#why-dvc-and-ray">Why DVC and Ray?</a></li> <li><a href="#tutorial-scope">Tutorial Scope</a> <ul> <li><a href="#high-level-solution-design">High-level solution design</a></li> <li><a href="#prerequisites">Prerequisites</a></li> </ul> </li> <li><a href="#-installation">👩‍💻 Installation</a></li> <li><a href="#-get-started-with-ray">⭐ Get Started with Ray</a> <ul> <li><a href="#1---overview-of-the-ray-framework">1 - Overview of the Ray Framework</a></li> <li><a href="#2---start-a-ray-cluster">2 - Start a Ray Cluster</a></li> <li><a href="#3---run-a-test-script-on-the-ray-cluster">3 - Run a test script on the Ray Cluster</a></li> </ul> </li> <li><a href="#%EF%B8%8F-run-dvc-pipeline-on-a-ray-cluster">🏃‍♂️ Run DVC Pipeline on a Ray Cluster</a> <ul> <li><a href="#1---design-solution-for-dvc--ray">1 - Design Solution for DVC + Ray</a></li> <li><a href="#2---create-a-dvc-pipeline">2 - Create a DVC pipeline</a> <ul> <li><a href="#tune-stage">Tune Stage</a></li> <li><a href="#train-stage">Train Stage</a></li> </ul> </li> <li><a href="#3---run-dvc-pipelines-on-ray-cluster">3 - Run DVC pipelines on Ray Cluster</a></li> </ul> </li> <li><a href="#-discuss-the-solution-design">💬 Discuss the Solution Design</a> <ul> <li><a href="#%EF%B8%8F-use-dvc-to-run-scripts-calling-ray-api">☝️ Use DVC to run scripts calling Ray API</a></li> <li><a href="#%EF%B8%8F-persist-dvc-stage-outputs-to-keep-them-available-for-downstream-stages-in-case-of-failure">☝️ Persist DVC stage outputs to keep them available for downstream stages in case of failure</a></li> <li><a href="#%EF%B8%8F-use-dvclive-to-track-live-metrics-updates-with-dvc-studio-and-dvc-extension-for-vs-code">☝️ <strong>Use DVCLive to track live metrics updates with DVC Studio and DVC Extension for VS Code</strong></a></li> <li><a href="#%EF%B8%8F-propagate-dvc-environment-variables-to-worker-nodes">☝️ Propagate DVC environment variables to Worker nodes</a></li> <li><a href="#%EF%B8%8F-copy-the-modelpth-file-from-the-ray-trial-folder-to-the-dvc-project-repository">☝️ Copy the <code>model.pth</code> file from the Ray Trial folder to the DVC project repository</a></li> </ul> </li> <li><a href="#-summing-up-dvc--ray-integration">🎨 Summing Up: DVC + Ray Integration</a> <ul> <li><a href="#key-takeaways">Key Takeaways</a></li> <li><a href="#looking-ahead-to-part-2">Looking Ahead to Part 2</a></li> </ul> </li> <li><a href="#references">References</a></li> </ul> </details> <h2 id="why-dvc-and-ray" style="position:relative;">Why DVC and Ray?<a href="#why-dvc-and-ray" aria-label="why dvc and ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> is an open-source tool that brings GitOps and reproducibility to data management, ML experiments, and model development. It connects versioned data sources and code with pipelines, tracks experiments, and registers models — all based on GitOps principles.</p> <p><a href="https://www.ray.io/" target="_blank" rel="nofollow noopener noreferrer">Ray</a> is an open-source unified computing framework that makes scaling AI and Python workloads easy — from reinforcement learning to deep learning to tuning and model serving. Ray makes it a breeze to scale your compute-intensive tasks from a single machine to a massive cluster without losing your mind.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d89dbe2bc88dcfeb76d8cde4662ce349/39600/2-dvc-ray-distributed-ml.png" alt="DVC + Ray for distributed ML" title="DVC + Ray for distributed ML" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>DVC and Ray make your ML projects more manageable and prepare them to tackle the challenges of tomorrow’s AI-driven landscape. Let’s explore this dynamic duo and unlock new potentials in your MLOps journey!</p> <blockquote> <p>💡 <strong>Want to learn more about DVC?</strong></p> <p>Join our online course about DVC: <a href="https://learn.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Iterative Tools for Data Scientists & Analysts course</a>!</p> </blockquote> <h2 id="tutorial-scope" style="position:relative;">Tutorial Scope<a href="#tutorial-scope" aria-label="tutorial scope permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This tutorial will guide users through creating automated, scalable, and distributed ML pipelines using DVC (Data Version Control) and Ray. We start with configuring the Ray Cluster for local and cloud environments. Then, we discuss the challenges of running DVC in distributed environments. Then, we’ll run a few examples of using DVC and Ray. By the end of the tutorial, you will be able to design, run, and manage ML pipelines distributed over multiple nodes and trackable through version control.</p> <p>For <strong>DVC users</strong>, this tutorial offers several advantages:</p> <ul> <li>Bring Distributed Computing Efficiency to DVC projects</li> <li>Easy use of AWS Cloud for Development and Production workflows</li> <li>Enable automated pipelines and data versioning in ML projects with Ray</li> </ul> <p>For <strong>Ray users</strong>, this tutorial aims to highlight the benefits of integrating DVC:</p> <ul> <li>Enhance Model Training Reproducibility with DVC’s data versioning capabilities</li> <li>Streamline ML Pipeline Management through DVC’s structured approach</li> <li>Facilitate Efficient Collaboration among teams by leveraging DVC for shared data and model management</li> </ul> <h3 id="high-level-solution-design" style="position:relative;">High-level solution design<a href="#high-level-solution-design" aria-label="high level solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let’s overview the high-level design of our target solution with DVC and Ray.</p> <ol> <li>Users can manage Ray Cluster and run DVC pipelines from a “local” environment.</li> <li>Ray distributes workloads across multiple workers and can auto-scale cluster nodes.</li> <li>During the training, DVCLive logs live updates of metrics and parameters to DVC Studio.</li> <li>DVC utilizes S3 to sync states between a Worker and Head nodes.</li> <li>DVC uses remote storage (AWS S3) to manage data and model artifacts.</li> <li>Users commit the results of the experiment to Git and DVC Remote Storage.</li> </ol> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/878eabd6bb9ecbcac34ceddabccf71f2/39600/3-solution-design.png" alt="Solution Design" title="Solution Design" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>High-Level Solution Design</em></p> <h3 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We expect that you:</p> <ul> <li>Have some experience with Machine Learning or Data Engineering pipelines</li> <li>Are familiar with DVC</li> </ul> <p>To follow this tutorial, you’ll need the following tools:</p> <ul> <li>Git</li> <li>Python 3.11 or above</li> <li>AWS CLI (if you want to run pipelines in AWS)</li> </ul> <h2 id="-installation" style="position:relative;">👩‍💻 Installation<a href="#-installation" aria-label=" installation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Creating an ML pipeline that runs distributed tasks is a powerful way to manage and scale your machine learning workflows. With DVC, we can efficiently orchestrate our pipeline stages and handle experiment outputs.</p> <p>To clone the example repository, you can follow these steps:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> clone https://github.com/iterative/tutorial-mnist-dvc-ray.git <span class="token builtin class-name">cd</span> tutorial-mnist-dvc-ray</code></pre></div> <p>Install Python dependencies:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">python3 <span class="token parameter variable">-m</span> venv .venv <span class="token builtin class-name">source</span> .venv/bin/activate pip <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt <span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span></code></pre></div> <h2 id="-get-started-with-ray" style="position:relative;">⭐ Get Started with Ray<a href="#-get-started-with-ray" aria-label=" get started with ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="1---overview-of-the-ray-framework" style="position:relative;">1 - Overview of the Ray Framework<a href="#1---overview-of-the-ray-framework" aria-label="1 overview of the ray framework permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://docs.ray.io/en/latest/ray-overview/index.html" target="_blank" rel="nofollow noopener noreferrer">Ray</a> is a framework for scaling AI and Python applications. For AI and ML applications, Ray helps to scale jobs without needing infrastructure expertise:</p> <ul> <li>Efficiently parallelize and distribute ML workloads across multiple nodes and GPUs.</li> <li>Leverage the ML ecosystem with native and extensible integrations.</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8f7a888d791d71b3524d7c72335a5efd/39600/4-ray-stack.png" alt="Stack of Ray libraries - a unified toolkit for ML workloads" title="Stack of Ray libraries - a unified toolkit for ML workloads" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Stack of Ray libraries - A Unified Toolkit For ML Workloads (<a href="https://docs.ray.io/en/latest/ray-overview/index.html" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a>)</em></p> <p>In this tutorial, we work with Ray Clusters and Ray AI Libraries (Ray Tune and Ray Train). <a href="https://docs.ray.io/en/latest/cluster/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray Cluster</a> is a set of <a href="https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-worker-nodes" target="_blank" rel="nofollow noopener noreferrer">Worker nodes</a> connected to a common Ray <a href="https://docs.ray.io/en/latest/cluster/key-concepts.html#cluster-head-node" target="_blank" rel="nofollow noopener noreferrer">Head node</a>.</p> <ul> <li>The Head node serves as the central coordination point for the Ray cluster. It manages the cluster’s metadata, maintains the cluster state, and handles task scheduling and management.</li> <li>Worker nodes are the computational workhorses of the Ray cluster. They are responsible for executing tasks and running computations for applications.</li> </ul> <p><img src="https://docs.ray.io/en/latest/_images/ray-cluster.svg" alt="Two nodes Ray Cluster"> <em>A Ray cluster with two worker nodes. Each node runs Ray helper processes to facilitate distributed scheduling and memory management. The head node runs additional control processes (highlighted in blue). Source: <a href="https://docs.ray.io/en/latest/cluster/key-concepts.html#head-node" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a></em></p> <p>Ray clusters can be fixed-size or autoscale up and down according to the resources requested by applications running on the cluster.</p> <p><a href="https://docs.ray.io/en/latest/tune/index.html" target="_blank" rel="nofollow noopener noreferrer">Ray Tune</a> is a Python Library that automates the hyperparameter tuning process across distributed resources. By integrating Ray Tune into the experiment workflow, we can evaluate numerous hyperparameter combinations in parallel, speeding up the search for optimal model configurations.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cf01238905df716a8a7f3855be9739f3/39600/5-ray-tune.png" alt="Distributed tuning with Ray" title="Distributed tuning with Ray" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Distributed tuning with distributed training per trial. Source: <a href="https://docs.ray.io/en/latest/ray-overview/use-cases.html" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a></em></p> <p><a href="https://docs.ray.io/en/latest/train/train.html" target="_blank" rel="nofollow noopener noreferrer">Ray Train</a> creates a setup to scale model training code from a single machine to a cluster of machines in the cloud and abstracts away the complexities of distributed computing. At a high level of abstraction, it distributes and runs training jobs among worker nodes.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9f2b1034200a057d471da25748e93e17/39600/6-ray-train-overview.png" alt="Ray Train Overview" title="Ray Train Overview" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ray Train Overview. Source: <a href="https://docs.ray.io/en/latest/train/overview.html" target="_blank" rel="nofollow noopener noreferrer">Ray Docs</a></em></p> <h3 id="2---start-a-ray-cluster" style="position:relative;">2 - Start a Ray Cluster<a href="#2---start-a-ray-cluster" aria-label="2 start a ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <blockquote> <p>💡 Navigate to the <code>main</code> branch in the repository</p> </blockquote> <p>To start a Ray Cluster, first initiate the Ray head node. The head node is the primary node in the Ray cluster that manages the worker nodes. Since this is a local setup, your machine will act as both the Head and Worker nodes. Use the following command:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ray start <span class="token parameter variable">--head</span></code></pre></div> <p>This command starts the Ray cluster with your machine acting as the head node.</p> <p>To monitor and debug Ray, view the dashboard at <a href="http://127.0.0.1:8265/" target="_blank" rel="nofollow noopener noreferrer">http://127.0.0.1:8265/</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/dd970b852f021bb7250e93c79d760c37/39600/7-ray-dashboard.png" alt="Ray Dashboard" title="Ray Dashboard" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ray Dashboard - Cluster Nodes</em></p> <blockquote> <p>💡 Multi-node Ray clusters are only supported on Linux. You may deploy Windows and OSX clusters for development by setting the environment variable <code>RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1</code>. Source: <a href="https://docs.ray.io/en/latest/cluster/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray Clusters Overview</a>.</p> </blockquote> <h3 id="3---run-a-test-script-on-the-ray-cluster" style="position:relative;">3 - Run a test script on the Ray Cluster<a href="#3---run-a-test-script-on-the-ray-cluster" aria-label="3 run a test script on the ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can run a simple test script to ensure your local Ray cluster works correctly. In your project directory, create a file named <strong><code>hello_cluster.py</code></strong> inside the <strong><code>src/test_scripts</code></strong> directory. Add a script to connect to the Ray cluster and print a message. Here’s an example script:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> ray <span class="token decorator annotation punctuation">@ray<span class="token punctuation">.</span>remote</span> <span class="token keyword">def</span> <span class="token function">hello_world</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">return</span> “Hello Ray cluster” <span class="token comment"># Automatically connect to the running Ray cluster.</span> ray<span class="token punctuation">.</span>init<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>ray<span class="token punctuation">.</span>get<span class="token punctuation">(</span>hello_world<span class="token punctuation">.</span>remote<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div> <p>Execute the script using Python. Open your terminal and run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">python src/test_scripts/hello_cluster.py</code></pre></div> <p>You should see an output similar to this:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token number">2023</span>-11-14 <span class="token number">12</span>:11:17,363 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: <span class="token number">192.168</span>.100.19:6379<span class="token punctuation">..</span>. <span class="token number">2023</span>-11-14 <span class="token number">12</span>:11:17,370 INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 Hello Ray cluster</code></pre></div> <p>This output indicates that your script has successfully connected to the local Ray cluster and executed the print statement.</p> <h2 id="️-run-dvc-pipeline-on-a-ray-cluster" style="position:relative;">🏃‍♂️ Run DVC Pipeline on a Ray Cluster<a href="#%EF%B8%8F-run-dvc-pipeline-on-a-ray-cluster" aria-label="️ run dvc pipeline on a ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>You have a single-node Ray Cluster at this step on your local machine. Let’s start with the DVC pipeline setup.</p> <p>Goals for this section:</p> <ul> <li>Design a Solution for DVC + Ray.</li> <li>Create a DVC pipeline with two stages: tune and train.</li> <li>Modify DVCLive to sync metrics and parameters with DVC Studio.</li> </ul> <h3 id="1---design-solution-for-dvc--ray" style="position:relative;">1 - Design Solution for DVC + Ray<a href="#1---design-solution-for-dvc--ray" aria-label="1 design solution for dvc ray permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The technical design calls for a structure where ML experiment scripts, managed by DVC, invoke Ray for their computation needs. DVC is the orchestrator, invoking the appropriate Ray functions for distributed processing.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6171b601ceccb690ed7d41ba04186227/39600/8-solution-design-local.png" alt="Design POC Solution for DVC + Ray (local)" title="Design POC Solution for DVC + Ray (local)" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Design POC Solution for DVC + Ray (local)</em></p> <p>This diagram outlines the integration of DVC (Data Version Control) with a Ray cluster for running ML experiments in a distributed manner:</p> <ol> <li>DVC initiates the process by running a stage script. The <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> pipeline definition is the blueprint for the ML workflow, defining stages that utilize Ray for hyperparameter tuning and subsequent training stages.</li> <li>Ray Job Submission: The stage script (e.g., <code>src/stages/tune.py</code>) starts a Ray application that submits computation jobs to Ray. The <code>src/stages/tune.py</code> script utilizes Ray Tune’s <code>Tuner</code> class to define and run the hyperparameter tuning trials.</li> <li>Ray Cluster contains a single Head Node where the actual computation occurs. (Note: In the production cluster, Ray runs the jobs distributed across multiple worker nodes). Ray saves results for each job (trial) to a local directory in a worker node (outside the DVC project repo).</li> <li>After all jobs complete, the stage script retrieves results from Ray’s trial directories to the DVC project repo (if needed).</li> <li>DVC manages the outputs of the pipeline, ensuring reproducibility and traceability.</li> </ol> <p>The result is a robust framework for conducting and managing ML experiments that are scalable, reproducible, and efficiently optimized. This framework not only streamlines the experimentation process but also simplifies the transition of models from development to production.</p> <h3 id="2---create-a-dvc-pipeline" style="position:relative;">2 - Create a DVC pipeline<a href="#2---create-a-dvc-pipeline" aria-label="2 create a dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In this tutorial, the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file contains only two stages in the ML pipeline: <code>tune</code> and <code>train</code>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf19a010eb28547e700c40d28eae4b1d/39600/9-dvc-pipeline.png" alt="DVC pipeline" title="DVC pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC pipeline configuration in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> with <code>tune</code> and <code>train</code> stages, and <code>plots</code> sections</em></p> <h4 id="tune-stage" style="position:relative;">Tune Stage<a href="#tune-stage" aria-label="tune stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>This initial stage is responsible for hyperparameter tuning. It uses Ray to distribute the computation involved in this process. The stage executes a Python script <code>tune.py</code> that optimizes hyperparameters using the Ray Tune. The output of this stage is <code>best_params.yaml</code>, which contains the best hyperparameters found during the tuning process.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">tune</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/stages/tune.py <span class="token punctuation">-</span><span class="token punctuation">-</span>config params.yaml <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> tune <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>tune.results_dir<span class="token punctuation">}</span>/best_params.yaml<span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">persist</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div> <p>Use two specific configuration parameters for the <code>best_params.yaml</code> output:</p> <ul> <li>Set <code>cache: false</code> to instruct DVC not to cache the file but version it with Git.</li> <li>Set <code>persist: true</code> to instruct DVC not to remove the file before reproducing the stage. It’s useful for stage dependencies when you work in an unstable environment (or debugging), and the stage script can fail for any reason. In this example, even if the <code>tune</code> stage fails, you can run the <code>train</code> stage using <code>best_params.yaml</code> from the previous run.</li> </ul> <h4 id="train-stage" style="position:relative;">Train Stage<a href="#train-stage" aria-label="train stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p>The Train Stage runs distributed computation via Ray. This stage depends on <code>best_params.yaml</code> generated by the <code>tune</code> stage to access the optimal hyperparameters for training the model. The <code>train</code> stage is invoked by the <code>train.py</code> script, which will train the model based on the tuned parameters.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/stages/train.py <span class="token punctuation">-</span><span class="token punctuation">-</span>config params.yaml <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> train <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>tune.results_dir<span class="token punctuation">}</span>/best_params.yaml <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>train.results_dir<span class="token punctuation">}</span>/model.pth</code></pre></div> <p>The trained model is saved as <code>model.pth</code>, with the path again parameterized to allow flexibility in the output location. The output model is automatically cached and versioned with DVC.</p> <h3 id="3---run-dvc-pipelines-on-ray-cluster" style="position:relative;">3 - Run DVC pipelines on Ray Cluster<a href="#3---run-dvc-pipelines-on-ray-cluster" aria-label="3 run dvc pipelines on ray cluster permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To execute your automated and distributed ML pipeline with DVC, perform the following steps:</p> <ul> <li>Set the PYTHONPATH environment variable to ensure Python scripts can access modules within your project’s directory by setting the <code>PYTHONPATH</code> environment variable.</li> <li>Run DVC pipeline with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command.</li> </ul> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">PYTHONPATH</span><span class="token operator">=</span><span class="token environment constant">$PWD</span> dvc exp run</code></pre></div> <p>This will start the pipeline, running the <code>tune</code> and <code>train</code> stages as defined in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file, utilizing distributed computation with Ray.</p> <p>You may see live updates of metrics and plots in <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a> and <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a>. DVC can generate and render plots based on your project’s data. Metrics and plots logged with DVCLive can be visualized in DVC Studio and DVC Extension for VS Code.</p> <p>A few benefits of tracking and visualizing metrics and plots with DVC (<a href="https://dvc.org/doc/user-guide/experiment-management/visualizing-plots" target="_blank" rel="nofollow noopener noreferrer">see docs</a>):</p> <ul> <li>Enhanced Experiment Tracking: Compare metrics, parameters, version of data, and plots between experiments in a live mode (docs: <a href="https://dvc.org/doc/studio/user-guide/experiments/visualize-and-compare" target="_blank" rel="nofollow noopener noreferrer">Visualize and Compare experiments</a>).</li> <li>Customize Visualization: Define visualization template, select data to be visualized and titles interactively, before or after the experiment is complete (docs: <a href="https://dvc.org/doc/user-guide/experiment-management/visualizing-plots#defining-plots" target="_blank" rel="nofollow noopener noreferrer">Defining plots</a>).</li> <li>Share & Version Control for Metrics: You can send <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments#live-metrics-and-plots" target="_blank" rel="nofollow noopener noreferrer">live metrics and plots</a> to <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a>, <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments#push-experiments" target="_blank" rel="nofollow noopener noreferrer">push</a> completed experiments (including data, models, and code), and convert an experiment into a <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments#persist-experiment" target="_blank" rel="nofollow noopener noreferrer">persistent</a> branch or commit in your Git repo (docs <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments" target="_blank" rel="nofollow noopener noreferrer">Sharing Experiments</a>).</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/979a10864ed18812ec17855a2ec3b7b5/39600/11-experiment-tracking.png" alt="Experiment tracking" title="Experiment tracking" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Experiment tracking with <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a> and <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a></em></p> <blockquote> <p>💡 Note: Sometimes, when you run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> with a local Ray Cluster, the process may get stuck with <code>Connecting to existing Ray cluster at address: 192.168.100.19:6379...</code> message due to a <code>ConnectionError</code> in Ray. In this case, open a new terminal session, export <code>PYTHONPATH</code>, and run the <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command there.</p> </blockquote> <h2 id="-discuss-the-solution-design" style="position:relative;">💬 Discuss the Solution Design<a href="#-discuss-the-solution-design" aria-label=" discuss the solution design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This section above explains a simple example of running DVC and Ray together. It’s not a production setup. But it’s a good start for developing and debugging the DVC pipeline with Ray.</p> <p>Let’s think about what decisions we made and discuss some details:</p> <ol> <li>Use DVC to run scripts calling Ray API.</li> <li>Persist DVC stage outputs to keep them available for downstream stages in case of failure.</li> <li>Use DVCLive to track metrics only on a worker with a rank of 0.</li> <li>Propagate DVC environment variables to a worker node using TorchTrainer <code>train_loop_config</code>.</li> <li>Copy the <code>model.pth</code> file from the Ray Trial folder to the DVC project repository.</li> </ol> <h3 id="️-use-dvc-to-run-scripts-calling-ray-api" style="position:relative;">☝️ Use DVC to run scripts calling Ray API<a href="#%EF%B8%8F-use-dvc-to-run-scripts-calling-ray-api" aria-label="️ use dvc to run scripts calling ray api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Ray framework provides a rich Python API for distributed data processing, model tuning, and training. Wrapping Ray scripts into callable Python modules simplifies using DVC. Therefore, you get two benefits:</p> <ul> <li>Get scalability and distributed training with Ray</li> <li>Get reproducibility and versioning with DVC</li> </ul> <p>A template of the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> for DVC + Ray:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">first_stage</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python first_script_with_ray.py <span class="token punctuation">...</span> <span class="token key atrule">next_stage</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python second_script_with_ray.py <span class="token punctuation">...</span></code></pre></div> <h3 id="️-persist-dvc-stage-outputs-to-keep-them-available-for-downstream-stages-in-case-of-failure" style="position:relative;">☝️ Persist DVC stage outputs to keep them available for downstream stages in case of failure<a href="#%EF%B8%8F-persist-dvc-stage-outputs-to-keep-them-available-for-downstream-stages-in-case-of-failure" aria-label="️ persist dvc stage outputs to keep them available for downstream stages in case of failure permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Set <code>persist: true</code> to instruct DVC not to remove the file before reproducing the stage. It’s useful for stage dependencies when you work in an unstable environment (or debugging), and the stage script might fail.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">first_stage</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python first_script_with_ray.py <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">stage_output.file</span><span class="token punctuation">:</span> <span class="token key atrule">persist</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div> <h3 id="️-use-dvclive-to-track-live-metrics-updates-with-dvc-studio-and-dvc-extension-for-vs-code" style="position:relative;">☝️ <strong>Use DVCLive to track live metrics updates with DVC Studio and <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a></strong><a href="#%EF%B8%8F-use-dvclive-to-track-live-metrics-updates-with-dvc-studio-and-dvc-extension-for-vs-code" aria-label="️ use dvclive to track live metrics updates with dvc studio and dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Ray Train lets you use native experiment tracking libraries inside the <a href="https://docs.ray.io/en/latest/train/overview.html#train-overview-training-function" target="_blank" rel="nofollow noopener noreferrer">train_func</a> function. <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> is a highly flexible and lightweight library that simplifies experiment tracking in DVC projects.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live <span class="token keyword">with</span> Live<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span> live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>metric_name<span class="token punctuation">,</span> value<span class="token punctuation">)</span></code></pre></div> <p>This solution uses log metrics with <code>Live()</code> inside <code>the train_func_per_worker()</code> function.</p> <p>One significant distinction between distributed and non-distributed training lies in the parallel execution of multiple processes in distributed training setups, which may yield identical results under specific configurations. When all processes communicate results to the tracking backend, there’s a risk of receiving duplicate entries (check <a href="https://docs.ray.io/en/latest/train/user-guides/experiment-tracking.html" target="_blank" rel="nofollow noopener noreferrer">Ray docs</a> for details).</p> <p>Therefore, a few adjustments should be made to DVCLive.</p> <ol> <li>Use DVCLive to track metrics only on a worker with a rank of 0.</li> <li>Use the <code>DVC_ROOT</code> variable to create the <a href="https://dvc.org/doc/dvclive/live/"><code>Live(dir=...)</code></a> object. DVC automatically sets the value for the <code>DVC_ROOT</code> variable to the directory of your DVC repository and ensures Ray writes metrics inside the repo (<a href="https://dvc.org/doc/user-guide/env" target="_blank" rel="nofollow noopener noreferrer">docs</a>).</li> </ol> <p>As a result, the DVCLive usage code inside the <code>train_func_per_worker()</code> function looks like the example below.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># train.py</span> <span class="token keyword">def</span> <span class="token function">train_func_per_worker</span><span class="token punctuation">(</span>config<span class="token punctuation">:</span> Dict<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># Initialize DVC Live</span> live <span class="token operator">=</span> <span class="token boolean">None</span> rank <span class="token operator">=</span> ray<span class="token punctuation">.</span>train<span class="token punctuation">.</span>get_context<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_world_rank<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># Create a Live object on the rank 0 worker</span> <span class="token keyword">if</span> rank <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> live <span class="token operator">=</span> Live<span class="token punctuation">(</span> <span class="token builtin">dir</span><span class="token operator">=</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>os<span class="token punctuation">.</span>environ<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"DVC_ROOT"</span><span class="token punctuation">,</span><span class="token string">""</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"results/dvclive"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">)</span> <span class="token keyword">for</span> epoch <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>epochs<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># ...epoch training</span> <span class="token comment"># Log metrics with print()</span> <span class="token keyword">if</span> live<span class="token punctuation">:</span> live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span><span class="token string">"loss"</span><span class="token punctuation">,</span> test_loss<span class="token punctuation">)</span> live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span><span class="token string">"accuracy"</span><span class="token punctuation">,</span> accuracy<span class="token punctuation">)</span> live<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> <p>Utilizing DVCLive in Python code for logging metrics and plots automatically generates the necessary configurations for plots within the dvc.yaml file. Below is an example configuration for metrics and plots:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">metrics</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> results/dvclive/metrics.json <span class="token key atrule">plots</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">accuracy</span><span class="token punctuation">:</span> <span class="token key atrule">x</span><span class="token punctuation">:</span> step <span class="token key atrule">y</span><span class="token punctuation">:</span> <span class="token key atrule">results/dvclive/plots/metrics/accuracy.tsv</span><span class="token punctuation">:</span> accuracy <span class="token key atrule">title</span><span class="token punctuation">:</span> Accuracy <span class="token key atrule">x_label</span><span class="token punctuation">:</span> Step <span class="token key atrule">y_label</span><span class="token punctuation">:</span> Accuracy <span class="token punctuation">-</span> <span class="token key atrule">loss</span><span class="token punctuation">:</span> <span class="token key atrule">template</span><span class="token punctuation">:</span> simple <span class="token key atrule">x</span><span class="token punctuation">:</span> step <span class="token key atrule">y</span><span class="token punctuation">:</span> <span class="token key atrule">results/dvclive/plots/metrics/loss.tsv</span><span class="token punctuation">:</span> loss <span class="token key atrule">title</span><span class="token punctuation">:</span> Loss <span class="token key atrule">x_label</span><span class="token punctuation">:</span> Step <span class="token key atrule">y_label</span><span class="token punctuation">:</span> Accuracy <span class="token punctuation">-</span> results/tune/plots/images</code></pre></div> <p>The train stage logs metrics and plots to results/dvclive. Datapoints for metrics and plots are saved in files and visualized later in DVC Studio and VS Code.<br> <span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/65e5240e1e47be1e405116f587cb2b85/39600/10-2-train-metrics.png" alt="Metrics and plot" title="Metrics and plot" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Metrics plot generated by the <code>tune</code> stage</em></p> <p>The tune stage logs a mean_accuracy_plot.png file to visualize metrics for tuning trials.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 567px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ca2e4e8b125c2f57b537f0c090594b8d/0a7db/10-tune-metrics.png" alt="Metrics plot" title="Metrics plot" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Metrics plot generated by the <code>tune</code> stage</em></p> <h3 id="️-propagate-dvc-environment-variables-to-worker-nodes" style="position:relative;">☝️ Propagate DVC environment variables to Worker nodes<a href="#%EF%B8%8F-propagate-dvc-environment-variables-to-worker-nodes" aria-label="️ propagate dvc environment variables to worker nodes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC environment variables are necessary for every Ray worker because they provide essential information and configurations for DVCLive, facilitating experiment tracking. These variables include:</p> <ol> <li><strong>DVC_STUDIO_REPO_URL</strong>: Repository URL where DVC stores versioned data.</li> <li><strong>DVC_STUDIO_TOKEN</strong>: Authentication token for secure access to DVC Studio.</li> <li><strong>DVC_STUDIO_URL</strong>: Web interface URL for managing DVC projects.</li> <li><strong>DVC_EXP_BASELINE_REV</strong>: Baseline revision for comparing experiment results.</li> <li><strong>DVC_EXP_NAME</strong>: Descriptive identifier for the experiment.</li> <li><strong>DVC_ROOT</strong>: Root directory of the DVC project on the filesystem.</li> </ol> <blockquote> <p>💡 Note: All environment variables above are set by DVC automatically when running a pipeline.</p> </blockquote> <p>You don’t need to care about DVC environment variables when running DVC in a non-distributed environment. However, running it in Ray Cluster requires setting up on every worker. In this solution, DVC environment variables are passed via <a href="https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html#ray.runtime_env.RuntimeEnv" target="_blank" rel="nofollow noopener noreferrer">RuntimeEnv</a> to specify a runtime environment for the whole job.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2a442f0d89f6c564d6a87f715ffcee7b/39600/12-env-vars.png" alt="Set up Environment Variables" title="Set up Environment Variables" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Set up DVC Environment Variables</em></p> <p>The code snippet below demonstrates an approach to managing DVC environment variables within a TorchTrainer setup.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">train_func_per_worker</span><span class="token punctuation">(</span>config<span class="token punctuation">:</span> Dict<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment">#...</span> <span class="token keyword">if</span> rank <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> live <span class="token operator">=</span> Live<span class="token punctuation">(</span> <span class="token builtin">dir</span><span class="token operator">=</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>os<span class="token punctuation">.</span>environ<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"DVC_ROOT"</span><span class="token punctuation">,</span><span class="token string">""</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"results/dvclive"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">train</span><span class="token punctuation">(</span>params<span class="token punctuation">:</span> <span class="token builtin">dict</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token boolean">None</span><span class="token punctuation">:</span> <span class="token comment">#...</span> trainer <span class="token operator">=</span> TorchTrainer<span class="token punctuation">(</span> train_loop_per_worker<span class="token operator">=</span>train_func_per_worker<span class="token punctuation">,</span> train_loop_config<span class="token operator">=</span>train_config<span class="token punctuation">,</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span> <span class="token comment">#...</span> <span class="token comment"># [1] Propogate DVC environment variables from Head Node to Workers</span> <span class="token comment"># =============================================</span> DVC_ENV_VARS <span class="token operator">=</span> <span class="token punctuation">{</span>k<span class="token punctuation">:</span> v <span class="token keyword">for</span> k<span class="token punctuation">,</span> v <span class="token keyword">in</span> os<span class="token punctuation">.</span>environ<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> k<span class="token punctuation">.</span>startswith<span class="token punctuation">(</span><span class="token string">"DVC"</span><span class="token punctuation">)</span><span class="token punctuation">}</span> ray<span class="token punctuation">.</span>init<span class="token punctuation">(</span>runtime_env<span class="token operator">=</span>RuntimeEnv<span class="token punctuation">(</span>env_vars<span class="token operator">=</span>DVC_ENV_VARS<span class="token punctuation">)</span><span class="token punctuation">)</span> train<span class="token punctuation">(</span>params<span class="token punctuation">)</span></code></pre></div> <ul> <li>To ensure that DVC environment variables are accessible within the training loop across all worker nodes, <code>RuntimeEnv</code> propagates these variables from the head node to the workers.</li> </ul> <h3 id="️-copy-the-modelpth-file-from-the-ray-trial-folder-to-the-dvc-project-repository" style="position:relative;">☝️ Copy the <code>model.pth</code> file from the Ray Trial folder to the DVC project repository<a href="#%EF%B8%8F-copy-the-modelpth-file-from-the-ray-trial-folder-to-the-dvc-project-repository" aria-label="️ copy the modelpth file from the ray trial folder to the dvc project repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Upon completing the training process, the <code>model.pth</code> file is saved in the Ray Trial folder. Therefore, it’s copied to the DVC project repository (as shown in the code example above).</p> <p>This ensures that the trained model file is appropriately stored within the DVC-managed project structure, facilitating version control and reproducibility.</p> <h2 id="-summing-up-dvc--ray-integration" style="position:relative;">🎨 Summing Up: DVC + Ray Integration<a href="#-summing-up-dvc--ray-integration" aria-label=" summing up dvc ray integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The DVC + Ray integration presents a comprehensive solution to the challenges of running machine learning experiments at scale. By addressing specific issues related to auto-scaling, execution optimization, live metrics tracking, and data synchronization, this setup ensures that machine learning teams can focus on innovation and experimentation backed by a robust, scalable, and efficient infrastructure.</p> <p>In Part 1 of the tutorial, we explored the basics of setting up and integrating DVC with Ray for distributed machine learning workflows. We covered the following key topics:</p> <ul> <li><strong>Introduction to Ray</strong>: We discussed Ray’s capabilities for scaling AI and Python applications, focusing on its ability to parallelize and distribute ML workloads across multiple nodes easily.</li> <li><strong>Ray Clusters</strong>: The architecture of Ray clusters was explained, highlighting the roles of head and worker nodes in managing and executing tasks.</li> <li><strong>Ray Tune and Ray Train</strong>: We introduced Ray Tune for hyperparameter optimization and Ray Train for scaling model training code, emphasizing their integration into ML workflows.</li> <li><strong>Local Ray Cluster Setup</strong>: Step-by-step instructions were provided for starting a Ray Cluster locally, showcasing how to test the setup with a simple script.</li> </ul> <h3 id="key-takeaways" style="position:relative;">Key Takeaways<a href="#key-takeaways" aria-label="key takeaways permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The key takeaway from Part 1 is the foundation it sets for scalable and efficient machine learning workflows. By leveraging Ray’s distributed computing capabilities and DVC’s data version control, we establish a robust framework for managing complex ML experiments. This combination enhances scalability, reproducibility, and collaboration in ML projects.</p> <h3 id="looking-ahead-to-part-2" style="position:relative;">Looking Ahead to Part 2<a href="#looking-ahead-to-part-2" aria-label="looking ahead to part 2 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In Part 2 of the tutorial, we will extend the solution to a Ray Cluster on AWS, demonstrating how to adapt the setup for cloud-based distributed computing. This will involve configuring AWS resources, deploying Ray clusters in the cloud, and running DVC-managed pipelines at scale. The focus will shift towards managing the increased complexity and leveraging cloud infrastructure to maximize the efficiency and performance of ML experiments.</p> <p>Stay tuned for detailed instructions on deploying and managing cloud-based Ray clusters with DVC as we take the scalability and efficiency of ML workflows to the next level.</p> <blockquote> <p>💡 Did you find this tutorial interesting? Please leave your comments and share your experience with DVC and Ray! Join us on <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> 🙌</p> </blockquote> <h2 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://dvc.org/doc/studio/user-guide/experiments/explore-ml-experiments" target="_blank" rel="nofollow noopener noreferrer">DVC Studio: Explore ML Experiments</a></li> <li><a href="https://docs.ray.io/en/latest/ray-overview/getting-started.html" target="_blank" rel="nofollow noopener noreferrer">Ray docs: Getting Started</a></li> <li><a href="https://www.anyscale.com/blog/ray-common-production-challenges-for-generative-ai-infrastructure" target="_blank" rel="nofollow noopener noreferrer">How Ray solves common production challenges for Generative AI infrastructure</a></li> <li><a href="https://medium.com/samsara-engineering/building-a-modern-machine-learning-platform-with-ray-eb0271f9cbcf" target="_blank" rel="nofollow noopener noreferrer">Building a Modern Machine Learning Platform with Ray</a></li> </ul>https://dvc.org/blog/dvc-slurm-cluster-exscientiahttps://dvc.org/blog/dvc-slurm-cluster-exscientiaMon, 11 Mar 2024 00:00:00 GMT<h2 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>For many ML projects, there comes a point when local development hits the wall and we need to scale up the underlying compute resources. Maybe the dataset grows too large for your primary workstation or the deep learning model requires several high-end GPUs. This should be a routine transition for ML developers, and one to which they shouldn’t have to give much thought. In this blog post, we’ll explain our approach to remote DVC experiments on a SLURM cluster and share some code to get you started.</p> <p>We work at an AI-driven precision medicine company called <a href="https://www.exscientia.ai/" target="_blank" rel="nofollow noopener noreferrer">Exscientia</a>. Our goal is to change the way the world discovers and develops new medicines. The company is roughly evenly split between biologists and chemists on one side and technologists on the other, with your two authors belonging to the latter group; Dom is an AI research scientist and Luis is an engineer. This context is important to understand why we gravitated towards DVC in the first place, and why we scaled it up the way we did.</p> <h2 id="why-dvc-on-slurm" style="position:relative;">Why DVC on SLURM?<a href="#why-dvc-on-slurm" aria-label="why dvc on slurm permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As demonstrated in <a href="https://en.wikipedia.org/wiki/Accelerate_(book)" target="_blank" rel="nofollow noopener noreferrer">research undertaken by the DevOps movement</a>, it’s hard to maintain consistent software delivery without well-designed tooling (like CI/CD) and a conducive developer culture (like PRs or working in small batches). Our domain is highly specific, but the same principles apply: to move fast while maintaining high quality, reliability and reproducibility, we need to adopt best DevOps practices. There are only so many hours in a day and you want to spend all of them on trying out new ideas and ideally none on setting up infrastructure. Good tooling optimises scientists’ efficiency and lets them run more experiments, each more thorough and exhaustive than would otherwise have been possible – all this while maintaining control over research code bases which can, if left unchecked, turn into precarious Jenga towers. Predictable code with clear standards also eases collaboration, the lifeblood of science. Consequently it’s much more important to pick an arbitrary standard than to obsess over any particular detail.</p> <p>At Exscientia we provide researchers with project templates that automatically set up version control and CI/CD as well as QA tooling like Black, Ruff and Mypy. To coherently extend this setup to the joint realms of data science and ML, we integrated DVC. Our scientists can set up a fresh DVC-enabled repository with all the productivity tooling in just a few keystrokes and start experimenting right away. And because DVC transparently extends Git, there is less tool-induced context switching: users are always dealing with Git in some shape or form, rather than Git (for the code) and a database hidden behind a web service (for all the rest of it). Less context switching translates to less frustration and more flow.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 681px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2bafb54e30355ed5d332a518a1e417b1/39600/high-quality-reliability-reproducibility.png" alt="High quality, reliability, and reproducibility" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>To maintain a frictionless developer experience even as model sizes grow beyond the means of the humble laptop, we surveyed the organisation’s entire computational estate with a view towards designing an effective developer experience. Our platforms must support a number of teams with on-demand Jupyter or RStudio instances as well as workflow orchestration engines. We need to run large unsupervised jobs, interactive analyses and development sessions across many domains and technologies: data processing, ML model training and chemical simulations, each with different resource requirements. Finally, submitting a large workload should be a smooth and routine experience.</p> <p>In the end, a cloud-deployed SLURM cluster fit the bill. It can efficiently scale compute resources while maintaining a user-friendly interface for job submission. As a bonus, many of our users are already familiar with SLURM from their past lives in academia. The principal mode of interaction is very simple: the user submits a Bash script describing exactly what they want to happen, including the exact resources required. SLURM will wait until such resources are available and then execute the job as instructed. Thanks to this highly general interface, the same computational resource, and its administrators, can support very diverse groups of users at the same time, reducing infrastructural complexity across the organisation.</p> <h2 id="a-sample-project" style="position:relative;">A sample project<a href="#a-sample-project" aria-label="a sample project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We’ll set up a <a href="https://github.com/Exscientia/rdvc-demo-project" target="_blank" rel="nofollow noopener noreferrer">basic project</a> for this demo and, to keep with the drug discovery theme, we will be predicting solubility of chemical compounds in water using only our recently open-sourced framework MolFlux.</p> <p>The DVC pipeline consists of a featurisation stage, which loads the “ESOL” dataset consisting of pairs of molecules and their aqueous solubilities - how easily a molecule dissolves in water.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bd47a8d8d16070d3da90c7c38a329d45/39600/stages.png" alt="Stages" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>A few words about molecules and neural networks. Cheminformatics typically represents molecules as graphs, with atoms acting as the nodes and chemical bonds as the edges. There are several ways to feed molecular data to neural networks, each with its own pros and cons. GNNs can act directly on the molecular graph. You can also represent the graph as a string (most commonly using the SMILES format) and feed it to any sequence model such as a transformer.</p> <p>In this example we’ll use a classic cheminformatics transformation called ECFP, or <a href="https://pubs.acs.org/doi/10.1021/ci100050t" target="_blank" rel="nofollow noopener noreferrer">extended connectivity fingerprint</a>. It’s essentially analogous to n-grams in NLP, which track whether a particular sequence of tokens appears in a text document. For example, does the 3-letter sequence “wea” appear in the Wikipedia article on blazers? Indeed it does, as part of “wear”.</p> <p>Returning to ECFPs defined on molecular graphs, each “n-gram” is an atom and its immediate (e.g. 2-hop) neighbourhood. Since the “vocabulary” of all possible “n-grams” is finite, we can associate to each molecule a finite bit-vector (of the same length as the vocabulary) such that the choice of 0 or 1 indicates whether the corresponding “n-gram” is present in the molecule. This bit-vector is the ECFP fingerprint. And since it has a constant length, we can feed it into a large variety of ML algorithms, such as the MLP in the training stage.</p> <p>We use DVC to configure and run the pipeline, decoupling the data featurisation step (where we convert molecules to ECFPs) from the model training step.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d522f9d2e40d12270de0689ed2a6a0ae/39600/stages-dvcyaml.png" alt="DVC Stage Spec" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>DVC pipelines are useful to organise projects. As they are versioned in Git, you can reproduce complete workflows and results. Running a new experiment is a command away:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>This executes and tracks experiments in your repository without polluting it with unnecessary Git commits, branches, directories, etc. For more information and examples, see the <a href="https://dvc.org/doc/command-reference/exp/run" target="_blank" rel="nofollow noopener noreferrer">DVC documentation</a>.</p> <p>It may not be immediately obvious, but our setup is highly modular. Head over to <code>src/rdvc_demo_project/config/main.yaml</code> to see just an example of configuration options we can tweak for each individual experiment. To start a much longer training run, execute</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">model.config.trainer.max_epochs</span><span class="token operator">=</span><span class="token number">100</span></span></code></pre></div> <p>MolFlux was built to be explicitly config-driven and DVC’s <a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra integration</a> exposes all of that flexibility out of the box.</p> <h2 id="in-the-cloud" style="position:relative;">In the cloud<a href="#in-the-cloud" aria-label="in the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now that DVC experiments run on our local machine, we’d like to move them to the SLURM cluster. In this second repository, we share the source code to an internal tool we call <a href="https://github.com/exs-dmiketa/rdvc" target="_blank" rel="nofollow noopener noreferrer">rDVC</a> (for <em>remote</em> DVC). It is, by design, a very thin layer around <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> and accepts all of its options and arguments. But on top of that it also recognises many of <a href="https://slurm.schedmd.com/sbatch.html" target="_blank" rel="nofollow noopener noreferrer"><code>sbatch</code> arguments and flags</a>, allowing it to control which computational resource inside the cluster will be used and for how long. For a full list of options consult <code>rdvc run –help</code>.</p> <p>Let’s demonstrate how it works.</p> <p>On its own, DVC knows nothing about your remote cluster, so we’ll need to start with a small amount of setup. Make sure you have cloned the sample project repo and installed the Python virtual environment using <code>init_python_venv.sh</code>. You will initialise your local rDVC config with</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> init project</span></code></pre></div> <p>Follow the wizard to set up default options for this project’s remote runs; they will be found in <code>.rdvc/config.toml</code> inside of the project repository. Depending on the cluster’s setup, you may be able to choose the <em>instance type</em> allocated to your job. For the demo we have configured the cluster with t3.xlarge, g5.xlarge and g5.12xlarge. Our internal version of rDVC supports many more instance types and we encourage you to fork rDVC, redefine supported instance types and make the tool your own. For this demo, we pick g5.xlarge as the default instance as we want access to the GPU. But let’s continue with the demo. To point rDVC at your SLURM cluster, we’ll run the global initialisation script next:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> init global</span></code></pre></div> <p>rDVC now knows how to contact SLURM, so let’s finish with configuration of the remote server:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> init remote</span></code></pre></div> <p>Nothing stands between us and a remote GPU-powered experiment! Since rDVC is in many ways just a wrapper around <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>, we can easily set off a run with modified parameters as</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">rdvc</span> run <span class="token parameter variable">-S</span> <span class="token assign-left variable">fabric</span><span class="token operator">=</span>gpu</span></code></pre></div> <p>When your run is finished you can pull it to your local repository with</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span> origin</span></code></pre></div> <p>and look at the results.</p> <h2 id="behind-the-scenes" style="position:relative;">Behind the scenes<a href="#behind-the-scenes" aria-label="behind the scenes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>rDVC compiled a SLURM batch (or “sbatch”) script containing these instructions:</p> <ol> <li>Clone the project repo</li> </ol> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token shebang important">#!/bin/bash</span> <span class="token comment">#SBATCH --output=".rdvc/logs/slurm-%j.out"</span> <span class="token comment">#SBATCH --job-name=rdvc-run:rdvc-demo-project:main</span> <span class="token comment">#SBATCH --wckey=rdvc-demo-project</span> <span class="token comment">#SBATCH --mail-type=END,FAIL</span> <span class="token comment">#SBATCH --mail-user=<[email protected]></span> <span class="token comment">#SBATCH --constraint=t3.xlarge</span> <span class="token comment">#SBATCH --cpus-per-task=2</span> <span class="token comment">#SBATCH --nodes=1</span> <span class="token comment">#SBATCH --exclusive</span> <span class="token comment"># Ensure bashrc is loaded</span> <span class="token builtin class-name">source</span> <span class="token string">"<span class="token variable">${<span class="token environment constant">HOME</span>}</span>/.bashrc"</span> <span class="token comment"># Exit on failure http://redsymbol.net/articles/unofficial-bash-strict-mode/</span> <span class="token builtin class-name">set</span> <span class="token parameter variable">-euxo</span> pipefail <span class="token assign-left variable"><span class="token environment constant">IFS</span></span><span class="token operator">=</span><span class="token string">$'<span class="token entity" title="\n">\n</span><span class="token entity" title="\t">\t</span>'</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_NAME</span><span class="token operator">=</span><span class="token string">"rdvc-demo-project"</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_URL</span><span class="token operator">=</span><span class="token string">"[email protected]:<user>/rdvc-demo-project.git"</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_BRANCH</span><span class="token operator">=</span><span class="token string">"main"</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_REV</span><span class="token operator">=</span><span class="token string">"<git_hash>"</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_DIR</span><span class="token operator">=</span><span class="token string">"<span class="token variable">${RDVC_DIR<span class="token operator">:-</span>${<span class="token environment constant">HOME</span>}</span>/.rdvc}"</span> <span class="token comment"># Prepare a directory for the current job</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_WORKSPACE_DIR</span><span class="token operator">=</span><span class="token string">"/tmp/rdvc-<span class="token variable">${SLURM_JOB_ID}</span>"</span> <span class="token function">mkdir</span> <span class="token parameter variable">-p</span> <span class="token string">"<span class="token variable">${RDVC_JOB_WORKSPACE_DIR}</span>"</span> <span class="token comment"># Ensure cleanup after job finishes, regardless of exit status</span> <span class="token keyword">function</span> <span class="token function-name function">cleanup_job_dir</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span> <span class="token builtin class-name">echo</span> <span class="token string">"Cleaning up the job directory."</span> <span class="token function">rm</span> <span class="token parameter variable">-rf</span> <span class="token string">"<span class="token variable">${RDVC_JOB_WORKSPACE_DIR}</span>"</span> <span class="token punctuation">}</span> <span class="token builtin class-name">trap</span> cleanup_job_dir EXIT <span class="token comment"># Create an insulated Git workspace for the current job</span> <span class="token builtin class-name">echo</span> <span class="token string">"Creating Git workspace."</span> <span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_REPO_DIR</span><span class="token operator">=</span><span class="token string">"<span class="token variable">${RDVC_JOB_WORKSPACE_DIR}</span>/<span class="token variable">${RDVC_JOB_REPO_NAME}</span>"</span> <span class="token function">git</span> clone <span class="token parameter variable">--branch</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_BRANCH}</span>"</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_URL}</span>"</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_DIR}</span>"</span> <span class="token builtin class-name">cd</span> <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_DIR}</span>"</span> <span class="token operator">||</span> <span class="token builtin class-name">exit</span> <span class="token comment"># Ensure the job runs on the same revision as was submitted (even if the branch has moved on in the meantime)</span> <span class="token function">git</span> checkout <span class="token string">"<span class="token variable">${RDVC_JOB_REPO_REV}</span>"</span></code></pre></div> <ol start="2"> <li>Install the Python virtual environment with <code>init_python_venv.sh</code></li> </ol> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Install Python environment</span> <span class="token builtin class-name">echo</span> <span class="token string">"Install Python environment."</span> ./init_python_venv.sh <span class="token builtin class-name">echo</span> <span class="token string">"Activate Python environment."</span> <span class="token builtin class-name">source</span> ./.venv/bin/activate <span class="token comment"># Setup links for the DVC cache shared among jobs and projects</span> dvc config <span class="token parameter variable">--local</span> cache.type hardlink,symlink,copy <span class="token comment"># Push results of experiments even if job fails</span> <span class="token keyword">function</span> <span class="token function-name function">cleanup_dvc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span> <span class="token keyword">if</span> <span class="token punctuation">[</span> <span class="token string">"<span class="token variable">$1</span>"</span> <span class="token operator">!=</span> <span class="token string">"0"</span> <span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token keyword">then</span> <span class="token comment"># Push cache of all runs, including failed</span> <span class="token builtin class-name">echo</span> <span class="token string">"Job failed. Pushing run cache."</span> dvc push --run-cache <span class="token keyword">else</span> <span class="token builtin class-name">echo</span> <span class="token string">"Job successfully finished."</span> <span class="token keyword">fi</span> deactivate cleanup_job_dir <span class="token punctuation">}</span> <span class="token builtin class-name">trap</span> <span class="token string">'cleanup_dvc $?'</span> EXIT</code></pre></div> <ol start="3"> <li>Execute dvc exp run -S fabric=gpu</li> </ol> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">RDVC_JOB_EXP_RUN_OPTIONS_STRING</span><span class="token operator">=</span><span class="token string">"-S fabric=gpu"</span> <span class="token builtin class-name">echo</span> <span class="token string">"Executing DVC experiment."</span> <span class="token builtin class-name">eval</span> <span class="token string">"dvc exp run --pull --allow-missing <span class="token variable">${RDVC_JOB_EXP_RUN_OPTIONS_STRING}</span>"</span></code></pre></div> <ol start="4"> <li>Push the experiment to the remote</li> </ol> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Push experiment to the remote and update the repository</span> <span class="token builtin class-name">echo</span> <span class="token string">"Pushing DVC experiment to Git and DVC remotes."</span> dvc exp push <span class="token variable">$RDVC_JOB_REPO_URL</span></code></pre></div> <p>This script is submitted to the cluster over SSH. You can view it in <code>~/.rdvc/submissions</code>.</p> <p>And that’s it! It’s so simple you could do it manually in an interactive SLURM session - and that happens to be a good way to debug issues. If your job fails, first consult its log over at <code>~/.rdvc/logs</code> and then try to reproduce the submission script from an interactive session.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We shared two repositories: a simple DVC project and a tool for remote execution on SLURM clusters. The latter is universal - it knows nothing about the project! - and easily hackable. We highly recommend to fork and customise it to your team’s needs.</p>https://dvc.org/blog/automate-data-validation-and-model-monitoring-with-evidently-and-dvchttps://dvc.org/blog/automate-data-validation-and-model-monitoring-with-evidently-and-dvcFri, 19 Jan 2024 00:00:00 GMT<p><em>Feel free to clone the repository provided. It's more than a learning tool; it's a flexible reference architecture that you can adapt to fit your unique use cases.</em></p> <h2 id="why-dvc-and-evidently" style="position:relative;">Why DVC and Evidently?<a href="#why-dvc-and-evidently" aria-label="why dvc and evidently permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In the realm of Machine Learning Operations (MLOps), ensuring the robustness and reliability of models is paramount. Using the right tools can significantly enhance your MLOps practices.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1158f2e1f91f438b80df0406fe0c1aaf/39600/2-mlops-workflow.png" alt="Typical Machine Learning Operations (MLOps) workflow" title="Typical Machine Learning Operations (MLOps) workflow" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Typical Machine Learning Operations (MLOps) workflow</em></p> <p><strong><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a></strong> is an open-source tool that brings agility and reproducibility to data science projects by treating data and model training pipelines as software. It connects versioned data sources and code with pipelines, track experiments, register models — all based on GitOps principles.</p> <p><strong><a href="https://github.com/evidentlyai/evidently" target="_blank" rel="nofollow noopener noreferrer">Evidently</a></strong> is an open-source Python library to evaluate, test, and <a href="https://www.evidentlyai.com/ml-in-production/model-monitoring" target="_blank" rel="nofollow noopener noreferrer">monitor ML models</a>. It has 100+ built-in metrics and tests on data quality, data drift, and model performance and helps interactively visualize them.</p> <p>When used together, DVC and Evidently tools offer a comprehensive solution for training, predicting, and monitoring ML models.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3300f3ab904f1e6f56d45ce0fc52a3d7/39600/3-dvc-evidently-features.png" alt="Core features of DVC and Evidently for MLOps practices" title="Core features of DVC and Evidently for MLOps practices" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Core features of DVC and Evidently for MLOps practices</em></p> <blockquote> <p>💡 <strong>Want to learn more about DVC and Evidently?</strong></p> <ul> <li><a href="https://learn.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Iterative Tools for Data Scientists & Analysts course</a> with DVC</li> <li><a href="https://www.evidentlyai.com/ml-observability-course" target="_blank" rel="nofollow noopener noreferrer">Open-source ML observability course</a> with Evidently</li> </ul> </blockquote> <h2 id="tutorial-scope" style="position:relative;">Tutorial scope<a href="#tutorial-scope" aria-label="tutorial scope permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This tutorial teaches you how to build DVC pipelines for training and monitoring jobs, parse Evidently reports, and version reference datasets.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6983ae2db58b2ee1742db44917f93659/39600/4-example-pipelines.png" alt="Pipelines and artifacts of the example project*" title="Pipelines and artifacts of the example project*" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Pipelines and artifacts of the example project</em></p> <p>By the end of this tutorial, you will learn how to implement an ML monitoring architecture using:</p> <ul> <li><a href="https://www.evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">Evidently</a> to perform data quality, data drift, and model quality checks.</li> <li><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> to run monitoring jobs and version monitoring artifacts</li> <li><a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> to save monitoring metrics from Python scripts and visualize in VS Code.</li> </ul> <p>Using a Python virtual environment, you can run the example on a local machine.</p> <h3 id="dataset-sales-forecasting" style="position:relative;">Dataset: Sales Forecasting<a href="#dataset-sales-forecasting" aria-label="dataset sales forecasting permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><strong>Dataset.</strong> You will be diving into a <a href="https://www.kaggle.com/c/bike-sharing-demand/data" target="_blank" rel="nofollow noopener noreferrer">Kaggle dataset</a> focused on Bike Sharing Demand. The goal is to predict hourly bike rental volumes.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f457eae2892f8cf155a481f3167c5c11/39600/5-tutorial-1-model-analytics-in-production.png" alt="Source: https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" title="Source: https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Source: <a href="https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" target="_blank" rel="nofollow noopener noreferrer">https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production</a></em></p> <p><strong>ML Application.</strong> Use historical usage and weather data to predict bike rental demand. Essential for operational efficiency and customer service.</p> <p>Similar applications:</p> <ul> <li>Applicable in sectors like retail, transportation, and energy for demand prediction.</li> <li>Ensures models stay relevant and effective despite changing data patterns.</li> </ul> <h3 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We expect that you:</p> <ul> <li>Have learned the for DVC by following the <a href="https://dvc.org/doc/start#get-started-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Get Started with DVC</a> guide</li> <li>Went through the Evidently <a href="https://docs.evidentlyai.com/get-started/tutorial/?utm_source=website&utm_medium=referral&utm_campaign=blog_text&utm_content=batch-ml-monitoring-architecture" target="_blank" rel="nofollow noopener noreferrer">Get Started Tutorial</a> and can generate visual and JSON Reports with Metrics.</li> </ul> <p>To follow this tutorial, you'll need the following tools installed on your local machine:</p> <ul> <li>Python version 3.11 or above</li> <li>Git</li> <li>VS Code and <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a></li> </ul> <blockquote> <p>💡 Note: we tested this example on macOS/Linux.</p> </blockquote> <h2 id="-installation" style="position:relative;">👩‍💻 Installation<a href="#-installation" aria-label=" installation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>First, install the pre-built example. Check the origin README file for more technical details and notes.</p> <p><strong>1. Fork / Clone this repository</strong></p> <p>Clone the GitHub repository with the example code. This repository provides the necessary files and scripts for setting up the integration between Evidently and DVC.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">git</span> clone https://github.com/iterative/evidently-dvc.git $ <span class="token builtin class-name">cd</span> evidently-dvc</code></pre></div> <p><strong>2. Install Python dependencies</strong></p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ python3 <span class="token parameter variable">-m</span> venv .venv $ <span class="token builtin class-name">echo</span> <span class="token string">"export PYTHONPATH=<span class="token environment constant">$PWD</span>"</span> <span class="token operator">>></span> .venv/bin/activate $ <span class="token builtin class-name">source</span> .venv/bin/activate $ pip <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt</code></pre></div> <blockquote> <p>💡 Note: To ensure everything runs smoothly, please make sure to execute all the code examples provided below within an activated virtual environment.</p> </blockquote> <h2 id="-run-ml-monitoring-example" style="position:relative;">🚀 Run ML monitoring example<a href="#-run-ml-monitoring-example" aria-label=" run ml monitoring example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now, let’s launch the pre-built example to run monitoring pipelines and manage monitoring artifacts using DVC and Evidently.</p> <h3 id="1-running-the-train-pipeline" style="position:relative;">1. Running the <code>train</code> pipeline<a href="#1-running-the-train-pipeline" aria-label="1 running the train pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To run the entire pipeline, execute a simple command in your terminal. Make sure you're in the project's root directory:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc exp run pipelines/train/dvc.yaml</code></pre></div> <p>This command runs the stages defined in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file located in <code>pipelines/train</code>. DVC experiments allow you to track changes made during each run, making it easier to iterate and improve your model. Here’s what happens in each stage:</p> <ul> <li><strong>load_data</strong>: <ul> <li>Downloads and unzips the dataset into your <code>data/</code> directory.</li> </ul> </li> <li><strong>extract_data</strong>: <ul> <li>Executes <code>src/stages/extract_data.py</code>, using parameters from <code>pipelines/train/params.yaml</code>.</li> <li>Outputs training and testing datasets to specified paths.</li> </ul> </li> <li><strong>train</strong>: <ul> <li>Runs <code>train.py</code>, training the model with the training data.</li> <li>Saves the model to <code>models/model.joblib</code></li> </ul> </li> <li><strong>evaluate</strong>: <ul> <li>Runs <code>evaluate.py</code> to assess the model on the test data.</li> <li>Outputs reference data for monitoring to <code>data/reference_data.csv</code>.</li> <li>Builds the model performance report using Evidently Regression Preset and saves it to <code>reports/train/model_performance.html</code>.</li> <li>Saves metrics to <code>reports/train/metrics.json</code>.</li> </ul> </li> </ul> <p>After the pipeline is complete, you can</p> <ul> <li>(1) visualize training metrics <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for Visual Studio Code</a> ,</li> <li>(2) open the detailed model performance HTML report built with Evidently in the browser.</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/47bdc4d5ee7cfd053bf17a9debffbdf8/39600/6-metrics-and-reports.png" alt="Metrics and reports for Training pipeline" title="Metrics and reports for Training pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Metrics and reports for Training pipeline</em></p> <blockquote> <p>💡 Note: Make sure you have the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for Visual Studio Code</a> installed.</p> </blockquote> <h3 id="2-running-the-predict-pipeline" style="position:relative;">2. Running the <code>predict</code> pipeline<a href="#2-running-the-predict-pipeline" aria-label="2 running the predict pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Once your model is trained and evaluated, the next vital step is to perform predictions on new data. To run the pipeline, execute the following command in your terminal:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc repro pipelines/predict/dvc.yaml</code></pre></div> <p>Here’s what happens in each stage:</p> <ul> <li><strong>predict</strong>: <ul> <li>Executes <code>src/stages/predict.py</code>, using parameters from <code>pipelines/predict/params.yaml</code>.</li> <li>Saves predictions to a CSV file, formatted as <code>data/predictions/${predict.week_start}--${predict.week_end}.csv</code>. Parameters <code>week_start</code> and <code>week_end</code> are located in the corresponding <code>params.yaml</code> file.</li> </ul> </li> </ul> <p>DVC automatically starts versioning control for the saved CSV file. You can now push the data to remote storage in Clouds.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/19ac13afb82e9d61ad06fe7726b3573f/39600/7-artifacts-versioned-with-dvc.png" alt="Managing prediction datasets with DVC" title="Managing prediction datasets with DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Managing prediction datasets with DVC</em></p> <blockquote> <p>💡 Note: You may find more features in scenarios for <a href="https://dvc.org/doc/user-guide/data-management/remote-storage" target="_blank" rel="nofollow noopener noreferrer">Data Management with DVC</a> in docs.</p> </blockquote> <h3 id="3-run-monitor-pipeline" style="position:relative;">3. Run <code>monitor</code> pipeline<a href="#3-run-monitor-pipeline" aria-label="3 run monitor pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The monitor pipeline consists of two key stages: <code>monitor_data</code> and <code>monitor_model</code>. These stages are crucial for ensuring your machine learning models' ongoing health and performance.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc repro pipelines/monitor/dvc.yaml</code></pre></div> <p>Here’s what happens in each stage:</p> <ul> <li><strong>monitor_data:</strong> <ul> <li>This stage is responsible for monitoring data quality and detecting any data drifts.</li> <li>Executes <code>src/stages/monitor_data.py</code> with configuration parameters from <code>pipelines/monitor/params.yaml</code>.</li> <li>Produces HTML reports for data drift and data quality, and stores them in a directory named as<code>reports/{predict.week_start}--${predict.week_end}</code>.</li> </ul> </li> <li><strong>monitor_model:</strong> <ul> <li>Focuses on monitoring the performance of the model and detecting any target drifts</li> <li>Executes <code>src/stages/monitor_model.py</code> with configuration parameters from <code>pipelines/monitor/params.yaml</code>.</li> <li>Generates HTML reports for model performance and target drift, saved in the specified monitoring reports directory names as <code>reports/{predict.week_start}--${predict.week_end}</code>.</li> </ul> </li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/836c14f61a048464283c543c2f53370a/39600/8-evidently-reports.png" alt="Model Performance and Data Validation reports" title="Model Performance and Data Validation reports" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Model Performance and Data Validation reports</em></p> <h2 id="-data-validation-and-model-monitoring-with-evidently" style="position:relative;">📈 Data Validation and Model Monitoring with Evidently<a href="#-data-validation-and-model-monitoring-with-evidently" aria-label=" data validation and model monitoring with evidently permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now, let’s explore how Evidently works internally as a part of an ML model monitoring architecture.</p> <h3 id="metrics-and-reports" style="position:relative;">Metrics and Reports<a href="#metrics-and-reports" aria-label="metrics and reports permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The idea behind Evidently is very simple: it calculates a bunch of metrics and organizes them into nice reports. Reports are the most effective way to analyze and debug your models and data visually. You may save reports as HTML files, JSON snapshots, or export the metrics externally by parsing JSON or Python dictionary outputs. This allows you to apply Evidently for multiple validation and monitoring scenarios in <a href="https://evidentlyai.com/blog/fastapi-tutorial" target="_blank" rel="nofollow noopener noreferrer">real-time</a> and <a href="https://www.evidentlyai.com/blog/batch-ml-monitoring-architecture" target="_blank" rel="nofollow noopener noreferrer">batch-scoring</a> ML applications:</p> <ul> <li>save monitoring reports in HTML files and use them to analyze and debug your models and data,</li> <li>get values for specific metrics, and log them to external databases (like PostgreSQL) and dashboarding tools (like Grafana),</li> <li>save monitoring reports (as snapshots) in JSON files over time and run an <a href="https://docs.evidentlyai.com/user-guide/monitoring/monitoring_overview" target="_blank" rel="nofollow noopener noreferrer">Evidently Monitoring Dashboard</a> for continuous monitoring.</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7428bfdbfbd449b82ab6e881b38f4505/39600/9-evidently.png" alt="Source: https://docs.evidentlyai.com/ " title="Source: https://docs.evidentlyai.com/ " loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Source: <a href="https://docs.evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">https://docs.evidentlyai.com/</a></em></p> <p>If you choose to use HTML and JSON files, you need a way to store and version them. In the following section of the tutorial, we will explore how DVC can assist with this.</p> <h3 id="data-requirements" style="position:relative;">Data Requirements<a href="#data-requirements" aria-label="data requirements permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To calculate metrics monitoring reports with Evidently, you typically need <strong>two datasets</strong>:</p> <ul> <li><strong>Reference</strong> dataset is a baseline for comparison or an exemplary dataset that helps generate test conditions. This can be training data or earlier production data. (from <a href="https://docs.evidentlyai.com/user-guide/input-data/data-requirements" target="_blank" rel="nofollow noopener noreferrer">docs</a>)</li> <li><strong>Current</strong> dataset is the dataset you want to evaluate. It can include the most recent production data. (from <a href="https://docs.evidentlyai.com/user-guide/input-data/data-requirements" target="_blank" rel="nofollow noopener noreferrer">docs</a>)</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c980fb2bf7a2d0daf0948a146024cb93/39600/10-evidently-datasets.png" alt="Original image: https://docs.evidentlyai.com/user-guide/input-data/data-requirements " title="Original image: https://docs.evidentlyai.com/user-guide/input-data/data-requirements " loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Original image: <a href="https://docs.evidentlyai.com/user-guide/input-data/data-requirements" target="_blank" rel="nofollow noopener noreferrer">https://docs.evidentlyai.com/user-guide/input-data/data-requirements</a></em></p> <p>In this tutorial, the reference dataset is a sample extracted from the training dataset. It helps to automatically generate a reference during the training and align the version of the reference dataset and a model.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># src/stages/evaluate.py</span> reference_data <span class="token operator">=</span> train_data<span class="token punctuation">.</span>sample<span class="token punctuation">(</span>frac<span class="token operator">=</span><span class="token number">0.3</span><span class="token punctuation">)</span></code></pre></div> <h2 id="-automate-data-and-monitoring-pipelines-with-dvc" style="position:relative;">📈 Automate Data and Monitoring Pipelines with DVC<a href="#-automate-data-and-monitoring-pipelines-with-dvc" aria-label=" automate data and monitoring pipelines with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This section will guide you through the design and implementation of monitoring pipelines, providing insights for the next improvements and customization.</p> <h3 id="separate-dvc-pipelines" style="position:relative;">Separate DVC pipelines<a href="#separate-dvc-pipelines" aria-label="separate dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In the tutorial example, we tried to achieve the following ML system design principles:</p> <ul> <li><strong>Modular Design</strong>: Each stage of the ML workflow, such as data preparation, model training, and monitoring, is encapsulated in separate DVC pipelines. This modular approach enhances maintainability and scalability.</li> <li><strong>Pipeline Independence</strong>: These pipelines can be run independently, which allows for flexibility in execution and troubleshooting. In a typical scenario, training, inference, and monitoring pipelines run independently at different time intervals and environments.</li> <li><strong>Reusability</strong>: By separating the pipelines, you can easily reuse components across different projects or stages of the same project.</li> </ul> <p>As a result, the tutorial example has three pipelines for training, prediction inference, and monitoring. DVC allows you to have multiple <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files to configure and run pipelines.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7ae9f4cbd0b609a6d4c7d29cd0eb12cf/39600/10-pipelines-dir.png" alt="Pipelines Directory Structure" title="Pipelines Directory Structure" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Pipelines Directory Structure</em></p> <p>Let’s explore an excerpt from the <code>pipelines/monitor/dvc.yaml</code> to discuss a few “advanced” configuration features you may find useful:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">PIPELINE_DIR</span><span class="token punctuation">:</span> pipelines/monitor <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">monitor_data</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/stages/monitor_data.py <span class="token punctuation">-</span><span class="token punctuation">-</span>config=$<span class="token punctuation">{</span>PIPELINE_DIR<span class="token punctuation">}</span>/params.yaml <span class="token key atrule">wdir</span><span class="token punctuation">:</span> ../.. <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>PIPELINE_DIR<span class="token punctuation">}</span>/params.yaml<span class="token punctuation">:</span> <span class="token punctuation">-</span> predict <span class="token punctuation">-</span> monitoring <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> src/stages/monitor_data.py <span class="token punctuation">-</span> $<span class="token punctuation">{</span>predict.predictions_dir<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>predict.week_start<span class="token punctuation">}</span><span class="token punctuation">-</span><span class="token punctuation">-</span>$<span class="token punctuation">{</span>predict.week_end<span class="token punctuation">}</span>.csv <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>monitoring.reports_dir<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>predict.week_start<span class="token punctuation">}</span><span class="token punctuation">-</span><span class="token punctuation">-</span>$<span class="token punctuation">{</span>predict.week_end<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>monitoring.data_drift_path<span class="token punctuation">}</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>monitoring.reports_dir<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>predict.week_start<span class="token punctuation">}</span><span class="token punctuation">-</span><span class="token punctuation">-</span>$<span class="token punctuation">{</span>predict.week_end<span class="token punctuation">}</span>/$<span class="token punctuation">{</span>monitoring.data_quality_path<span class="token punctuation">}</span></code></pre></div> <ul> <li>☝️ <strong>Using <code>vars</code>:</strong> <ul> <li>Variables (<code>vars</code>) in DVC define values that can be reused across the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. It makes complex <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files more readable and easier to update.</li> <li>In this example, <code>PIPELINE_DIR</code> is used to specify the pipeline directory in the project repository. You may reference this variable using the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#templating" target="_blank" rel="nofollow noopener noreferrer">templating</a> format to insert values like <code>${PIPELINE_DIR}</code>.</li> </ul> </li> <li>☝️ <strong>Using <code>wdir</code>:</strong> <ul> <li>The <code>wdir</code> (working directory) key in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> sets the directory context for running the commands defined in a stage. Allows you to use relative paths for dependencies (<code>deps</code>), outputs (<code>outs</code>), and scripts within that directory.</li> <li>In this example, <code>wdir: ../..</code> points to the repository root. So, paths in <code>deps</code> and <code>outs</code> are easier to read and maintain.</li> </ul> </li> <li>☝️ <strong>Using separate <code>params.yaml</code>:</strong> <ul> <li>The <code>params.yaml</code> file holds parameters, and DVC allows it to have multiple ones.</li> <li>This example has separate <code>params.yaml</code> file for each pipeline. To let DVC understand which file to use, we specify the full path to the <code>params.yaml</code> using the <code>PIPELINE_DIR</code> variable.</li> </ul> </li> </ul> <h3 id="storing-monitoring-configuration-in-paramsyaml" style="position:relative;">Storing monitoring configuration in <code>params.yaml</code><a href="#storing-monitoring-configuration-in-paramsyaml" aria-label="storing monitoring configuration in paramsyaml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In some monitoring scenarios, you may have parameterized pipelines. Using DVC you may find it useful to reuse <code>params.yaml</code> file to configure the monitoring pipeline. This brings a few benefits:</p> <ul> <li><strong>Ease of Modification</strong>: You can quickly adjust the pipeline's behavior by modifying the parameters in this file, such as changing the data source or tuning model parameters.</li> <li><strong>Version Control for Parameters</strong>: Since <code>params.yaml</code> is under Git version control, changes in configurations are tracked by Git, ensuring reproducibility and transparency in your pipeline's evolution.</li> </ul> <p>Let’s explore <code>pipelines/monitor/params.yaml</code></p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">---</span> <span class="token key atrule">data</span><span class="token punctuation">:</span> <span class="token key atrule">predict_data</span><span class="token punctuation">:</span> data/test.csv <span class="token key atrule">target_col</span><span class="token punctuation">:</span> cnt <span class="token key atrule">prediction_col</span><span class="token punctuation">:</span> prediction <span class="token key atrule">numerical_features</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">'temp'</span><span class="token punctuation">,</span> <span class="token string">'atemp'</span><span class="token punctuation">,</span> <span class="token string">'hum'</span><span class="token punctuation">,</span> <span class="token string">'windspeed'</span><span class="token punctuation">,</span> <span class="token string">'hr'</span><span class="token punctuation">,</span> <span class="token string">'weekday'</span><span class="token punctuation">]</span> <span class="token key atrule">categorical_features</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token string">'season'</span><span class="token punctuation">,</span> <span class="token string">'holiday'</span><span class="token punctuation">,</span> <span class="token string">'workingday'</span><span class="token punctuation">]</span> <span class="token key atrule">predict</span><span class="token punctuation">:</span> <span class="token key atrule">model_path</span><span class="token punctuation">:</span> models/model.joblib <span class="token key atrule">week_start</span><span class="token punctuation">:</span> <span class="token string">'2011-01-29'</span> <span class="token key atrule">week_end</span><span class="token punctuation">:</span> <span class="token string">'2011-02-04'</span> <span class="token key atrule">predictions_dir</span><span class="token punctuation">:</span> data/predictions <span class="token key atrule">monitoring</span><span class="token punctuation">:</span> <span class="token key atrule">reports_dir</span><span class="token punctuation">:</span> reports <span class="token key atrule">reference_data</span><span class="token punctuation">:</span> data/reference_data.csv <span class="token comment"># for monitor_model</span> <span class="token key atrule">model_performance_path</span><span class="token punctuation">:</span> model_performance.html <span class="token key atrule">target_drift_path</span><span class="token punctuation">:</span> target_drift.html <span class="token comment"># for monitor_data</span> <span class="token key atrule">data_drift_path</span><span class="token punctuation">:</span> data_drift.html <span class="token key atrule">data_quality_path</span><span class="token punctuation">:</span> data_quality.html</code></pre></div> <ul> <li>☝️ <strong>List features to be included in monitoring reports:</strong> <ul> <li><code>target_col</code> and <code>prediction_col</code> define the names of the target and prediction columns,</li> <li><code>numerical_features</code> and <code>categorical_features</code> define feature names for monitoring purposes. This could be especially beneficial for data monitoring and data drift reports.</li> </ul> </li> <li>☝️ <strong>Parametrized data samples:</strong> <ul> <li><code>week_start</code> and <code>week_end</code> define the time frame for which predictions are generated. This example can be modified to support other approaches for data extraction.</li> </ul> </li> <li>☝️ <strong>Specify a reference dataset:</strong> <ul> <li><code>reference_data</code> specifies a path to the reference dataset used in monitoring.</li> <li>You may have multiple reference datasets and select among them to generate reports.</li> </ul> </li> <li>☝️ <strong>Specify the location to store monitoring artifacts:</strong> <ul> <li><code>monitoring</code> section also specifies the location for monitoring reports.</li> <li>You may update the reports directory or filenames in a single place. It’s handy!</li> </ul> </li> </ul> <h3 id="log-monitoring-metrics-with-dvclive-and-visualize-them-in-vs-code-ide" style="position:relative;">Log monitoring metrics with DVCLive and visualize them in VS Code IDE<a href="#log-monitoring-metrics-with-dvclive-and-visualize-them-in-vs-code-ide" aria-label="log monitoring metrics with dvclive and visualize them in vs code ide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> provides a Python API to log metrics, plots, models, and other artifacts from code. Metrics and plots saved with DVCLive can be automatically visualized in DVC extension for VS Code.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2ad21f7263dfbc0cf88d0bc7c9a2b90/39600/11-metrics-vscode.png" alt="Metrics in DVC Extension for VS Code" title="Metrics in DVC Extension for VS Code" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Metrics in DVC Extension for VS Code</em></p> <p>Let’s explore an example of the <code>src/stages/evaluate.py</code> script to demonstrate how DVCLive can help in DVC projects.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment"># Build a report</span> model_performance_report <span class="token operator">=</span> Report<span class="token punctuation">(</span>metrics<span class="token operator">=</span><span class="token punctuation">[</span>RegressionPreset<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> model_performance_report<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span> <span class="token comment"># Extract metrics</span> regression_metrics<span class="token punctuation">:</span> Dict <span class="token operator">=</span> model_performance_report<span class="token punctuation">.</span>as_dict<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token string">'metrics'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'result'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"current"</span><span class="token punctuation">]</span> metric_names <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'r2_score'</span><span class="token punctuation">,</span> <span class="token string">'rmse'</span><span class="token punctuation">,</span> <span class="token string">'mean_error'</span><span class="token punctuation">,</span> <span class="token string">'mean_abs_error'</span><span class="token punctuation">,</span> <span class="token string">'mean_abs_perc_error'</span><span class="token punctuation">]</span> selected_metrics <span class="token operator">=</span> <span class="token punctuation">{</span>k<span class="token punctuation">:</span> regression_metrics<span class="token punctuation">.</span>get<span class="token punctuation">(</span>k<span class="token punctuation">)</span> <span class="token keyword">for</span> k <span class="token keyword">in</span> metric_names<span class="token punctuation">}</span> <span class="token comment"># Save evaluation metrics with DVCLive</span> <span class="token keyword">with</span> Live<span class="token punctuation">(</span><span class="token builtin">dir</span><span class="token operator">=</span><span class="token builtin">str</span><span class="token punctuation">(</span>REPORTS_DIR<span class="token punctuation">)</span><span class="token punctuation">,</span> dvcyaml<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{</span>pdir<span class="token punctuation">}</span></span><span class="token string">/dvc.yaml"</span></span><span class="token punctuation">,</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span> <span class="token punctuation">[</span>live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>k<span class="token punctuation">,</span> v<span class="token punctuation">,</span> plot<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token keyword">for</span> k<span class="token punctuation">,</span>v <span class="token keyword">in</span> selected_metrics<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span></code></pre></div> <p>This code snippet demonstrates how to log machine learning model performance metrics calculated with Evidently using DVCLive. Here's a breakdown of what it does:</p> <ol> <li><code>model_performance_report</code> is created using Regression Preset from Evidently.</li> <li>The <code>model_performance_report</code> is executed with <code>.run(...)</code>, where the actual model evaluation and metric computation occur.</li> <li>After <code>model_performance_report</code> building completes, you may parse the required metrics. In this example <code>selected_metrics</code> contains <code>['r2_score', 'rmse', 'mean_error', 'mean_abs_error', 'mean_abs_perc_error']</code>.</li> <li>Live object context logs <code>selected_metrics</code> using <code>live.log_metrics()</code> method. There are few important arguments: <ol> <li><code>dir=str(REPORTS_DIR)</code> instructs DVCLive to save metrics to <code>reports/train</code> directory</li> <li><code>dvcyaml=f"{pdir}/dvc.yaml</code> instructs DVCLive to use <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> for the <code>train</code> stage to add information about metrics files. The full path is <code>pipelines/train/dvc.yaml</code> .</li> </ol> </li> </ol> <blockquote> <p>💡 Note: If you are interested in other scenarios of DVCLive with Evidently integration, check <a href="https://dvc.org/doc/user-guide/integrations/evidently" target="_blank" rel="nofollow noopener noreferrer">this integration example</a></p> </blockquote> <h3 id="versioning-the-reference-dataset-and-monitoring-reports" style="position:relative;">Versioning the Reference Dataset and Monitoring Reports<a href="#versioning-the-reference-dataset-and-monitoring-reports" aria-label="versioning the reference dataset and monitoring reports permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This example shows that DVC allows easily managed reference datasets for monitoring purposes, and version monitoring reports themselves.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 600px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/60b6965e5137e8d26a2495ad004864e5/39600/12-versioning.png" alt="Versioning reference datasets with DVC" title="Versioning reference datasets with DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Versioning reference datasets with DVC</em></p> <p>There are a few benefits for versioning reference datasets and monitoring reports with DVC:</p> <ul> <li><strong>Registry of Reference Datasets:</strong> DVC helps store, version, and download datasets for monitoring purposes. You may need to download the reference dataset saved to cloud storage for a monitoring job in the production environment. DVC makes life easier!</li> <li><strong>Traceability</strong>: This practice ensures traceability, allowing you to link model performance back to specific data versions.</li> <li><strong>Version Control of Reports</strong>: You may want to manage all monitoring reports with DVC. It ensures a historical record of your model's performance and data quality.</li> </ul> <h2 id="-summing-up" style="position:relative;">🎨 Summing up<a href="#-summing-up" aria-label=" summing up permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The combination of DVC and Evidently in automating data and monitoring pipelines offers a structured and efficient approach to ML model management. This setup enhances the reproducibility and reliability of your ML workflows and provides a clear framework for monitoring and improving your models over time. With this setup, you're well-equipped to maintain high-quality ML models responsive to the dynamic nature of real-world data.</p> <p>However, this tutorial covers only a single approach for DVC and Evidently integration. We still working on other interesting scenarios and looking for community support! Stay tuned!</p> <blockquote> <p>💡 Did you find this tutorial interesting? Please, leave your comments and share your experience with DVC and Evidently! Join us on <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> 🙌</p> </blockquote> <h2 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://www.evidentlyai.com/blog/tutorial-1-model-analytics-in-production" target="_blank" rel="nofollow noopener noreferrer">How to break a model in 20 days. A tutorial on production model analytics</a></li> <li><a href="https://iterative.ai/blog/turn-vs-code-into-ml-platform" target="_blank" rel="nofollow noopener noreferrer">Turn Your Favorite IDE into a Full Machine Learning Experimentation Platform</a></li> </ul>https://dvc.org/blog/dvc-git-lfshttps://dvc.org/blog/dvc-git-lfsWed, 03 Jan 2024 00:00:00 GMT<p>One of the main features provided by DVC is the ability to <a href="https://dvc.org/doc/command-reference/import#example-importing-from-any-git-repository" title="dvc import" target="_blank" rel="nofollow noopener noreferrer">import</a> and <a href="https://dvc.org/doc/command-reference/get#examples-get-a-misc-git-tracked-file" title="dvc get" target="_blank" rel="nofollow noopener noreferrer">download</a> files from any Git repository. In prior releases this came with the caveat where projects which use <a href="https://git-lfs.com/" target="_blank" rel="nofollow noopener noreferrer">Git LFS</a> were unsupported. As of version 3.31.0, DVC now supports reading Git LFS objects, so you can now <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> upstream datasets from platforms like <a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">Hugging Face</a> which use Git LFS, without needing to install any additional dependencies! Read on for an overview on how the DVC Git LFS client was implemented.</p> <p><em>To get started using DVC with Hugging Face, please refer to the DVC integrations <a href="https://dvc.org/doc/user-guide/integrations/huggingface" title="DVC/Hugging Face Integration" target="_blank" rel="nofollow noopener noreferrer">documentation</a></em></p> <p>DVC builds on top of Git's versioning capabilities using the open source libraries <a href="https://www.dulwich.io/" target="_blank" rel="nofollow noopener noreferrer">Dulwich</a> and <a href="https://www.pygit2.org/" target="_blank" rel="nofollow noopener noreferrer">pygit2</a> (which provides Python bindings for the C library <a href="https://github.com/libgit2/libgit2" target="_blank" rel="nofollow noopener noreferrer">libgit2</a>). Using these libraries provides DVC with access to Git functionality without requiring a traditional command line Git installation, which can be particularly useful in containerized environments. When integrating support for Git LFS support into DVC, we wanted to keep the same approach, so DVC users could simply install DVC, and then import and download files from any Git repository, regardless of whether or not that repository uses Git LFS. Neither Dulwich nor libgit2/pygit2 support Git LFS natively, but libgit2 does provide an API for the low level Git filters functionality used by Git LFS. We have <a href="https://github.com/libgit2/pygit2/pull/1237" title="pygit2 filters pull request" target="_blank" rel="nofollow noopener noreferrer">contributed</a> to pygit2 so that pygit2 users (like DVC) can now write libgit2 filters purely in Python, without needing to use the lower level libgit2 C API.</p> <p><em>DVC's Git client library (which wraps Dulwich and pygit2) is available <a href="https://github.com/iterative/scmrepo" target="_blank" rel="nofollow noopener noreferrer">here</a></em></p> <h2 id="git-filters-overview" style="position:relative;">Git filters overview<a href="#git-filters-overview" aria-label="git filters overview permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Git supports using attribute <a href="https://git-scm.com/docs/gitattributes#_filter" title="Git attributes filters" target="_blank" rel="nofollow noopener noreferrer">filters</a> to manipulate how objects are stored internally in Git compared to how they are stored in your workspace. One commonly used built-in filter is the CRLF filter, which will adjust line endings in text files. The CRLF filter is typically used to ensure that files are checked out into the workspace using the appropriate line endings for the user's platform (linefeed on Unix and carriage return + linefeed on Windows), but are only stored in Git with Unix-style line endings.</p> <p>Git LFS also works by using Git filters. When you add a file with the <code>filter=lfs</code> attribute to Git, The Git LFS filter generates a "pointer" for Git to store internally. The LFS pointer is a small text file containing a SHA256 LFS object ID for the original file. The Git LFS filter places the original file in Git LFS storage, and then outputs the pointer to Git (instead of the original file). Upon checkout, Git passes the pointer to the Git LFS filter, which then reads the LFS object ID and checks out the appropriate original file into your workspace.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version https://git-lfs.github.com/spec/v1 oid sha256:b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c size 4</code></pre></div> <p><em>Example Git LFS pointer</em></p> <h2 id="libgit2-and-pygit2-filters" style="position:relative;">libgit2 and pygit2 filters<a href="#libgit2-and-pygit2-filters" aria-label="libgit2 and pygit2 filters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When saving objects in Git and when checking them back out to the workspace, libgit2 runs a chain of registered filters. Each filter in the chain modifies the object data as needed, and then passes the modified result into the next filter. While writing a libgit2 filter in C is fairly complex and requires implementing multiple levels of callback structs for handling the underlying buffered write streams in addition to the filter itself, this is simplified by our newly contributed support for Python filters in pygit2. The low level libgit2 APIs are abstracted away, and a subclassed <code>pygit2.Filter</code>implementation only needs to implement three basic methods, <code>check()</code>, <code>write()</code> and <code>close()</code>.</p> <ul> <li><code>Filter.check()</code> is called prior to processing any object with Git attributes which match the registered filter, and the filter can verify whether or not it should be used with the given object, or indicate that the filter does not need to be applied.</li> <li><code>Filter.write()</code> is called one or more times and is used to “write” input data chunks to the filter.</li> <li><code>Filter.close()</code> is called after all of the input data has been written to the filter.</li> </ul> <p>The filter can send output data chunks to the next filter in the chain as needed via the <code>write_next()</code> callback.</p> <p><em>Note: in Git, <code>smudge</code> filters are run when checking out objects from the Git object database into the workspace, and <code>clean</code> filters are run when saving objects from the workspace into the Git object database. In libgit2 and pygit2, a single filter is registered which is used in both cases, and the direction is indicated by the <code>mode</code> parameter.</em></p> <h2 id="the-scmrepo-git-lfs-filter" style="position:relative;">The scmrepo Git LFS filter<a href="#the-scmrepo-git-lfs-filter" aria-label="the scmrepo git lfs filter permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thanks to this higher level abstraction in pygit2, implementing the Git LFS <code>smudge</code> filter in Python is straightforward:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> <span class="token keyword">def</span> <span class="token function">check</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> src<span class="token punctuation">:</span> <span class="token string">"FilterSource"</span><span class="token punctuation">,</span> attr_values<span class="token punctuation">:</span> List<span class="token punctuation">[</span><span class="token builtin">str</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> attr_values<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">"lfs"</span><span class="token punctuation">:</span> <span class="token keyword">if</span> src<span class="token punctuation">.</span>mode <span class="token operator">!=</span> GIT_FILTER_CLEAN<span class="token punctuation">:</span> self<span class="token punctuation">.</span>_smudge_buf <span class="token operator">=</span> io<span class="token punctuation">.</span>BytesIO<span class="token punctuation">(</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>_smudge_root <span class="token operator">=</span> src<span class="token punctuation">.</span>repo<span class="token punctuation">.</span>workdir <span class="token keyword">or</span> src<span class="token punctuation">.</span>repo<span class="token punctuation">.</span>path <span class="token keyword">return</span> <span class="token keyword">raise</span> Passthrough</code></pre></div> <p>In <code>check()</code>, the first element in <code>attr_values</code> will contain the object’s <code>filter</code> Git attribute. We verify that the object has <code>filter=lfs</code> set and that we are in <code>smudge</code> mode (our filter is currently read-only and does not need to implement <code>clean</code> mode). When in <code>smudge</code> mode we initialize an internal buffer which will be used for reading the pointer data from Git, as well as storing the original Git repository root path (which will be needed later).</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">write</span><span class="token punctuation">(</span> self<span class="token punctuation">,</span> data<span class="token punctuation">:</span> <span class="token builtin">bytes</span><span class="token punctuation">,</span> src<span class="token punctuation">:</span> <span class="token string">"FilterSource"</span><span class="token punctuation">,</span> write_next<span class="token punctuation">:</span> Callable<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token builtin">bytes</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token boolean">None</span><span class="token punctuation">]</span> <span class="token punctuation">)</span><span class="token punctuation">:</span> … self<span class="token punctuation">.</span>_smudge_buf<span class="token punctuation">.</span>write<span class="token punctuation">(</span>data<span class="token punctuation">)</span></code></pre></div> <p>In <code>write()</code> we append the input chunk to our buffer and then return. We do not write to the next filter, since Git LFS <code>smudge</code> depends on reading the entire pointer input before we can output any data.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">close</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> write_next<span class="token punctuation">:</span> Callable<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token builtin">bytes</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span> … self<span class="token punctuation">.</span>_smudge<span class="token punctuation">(</span>write_next<span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">_smudge</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> write_next<span class="token punctuation">:</span> Callable<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token builtin">bytes</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span> … self<span class="token punctuation">.</span>_smudge_buf<span class="token punctuation">.</span>seek<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token keyword">with</span> Git<span class="token punctuation">(</span>self<span class="token punctuation">.</span>_smudge_root<span class="token punctuation">)</span> <span class="token keyword">as</span> scm<span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> url <span class="token operator">=</span> get_fetch_url<span class="token punctuation">(</span>scm<span class="token punctuation">)</span> <span class="token keyword">except</span> InvalidRemote<span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token boolean">None</span> fobj <span class="token operator">=</span> smudge<span class="token punctuation">(</span>scm<span class="token punctuation">.</span>lfs_storage<span class="token punctuation">,</span> self<span class="token punctuation">.</span>_smudge_buf<span class="token punctuation">,</span> url<span class="token operator">=</span>url<span class="token punctuation">)</span> data <span class="token operator">=</span> fobj<span class="token punctuation">.</span>read<span class="token punctuation">(</span>io<span class="token punctuation">.</span>DEFAULT_BUFFER_SIZE<span class="token punctuation">)</span> <span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token keyword">while</span> data<span class="token punctuation">:</span> write_next<span class="token punctuation">(</span>data<span class="token punctuation">)</span> data <span class="token operator">=</span> fobj<span class="token punctuation">.</span>read<span class="token punctuation">(</span>io<span class="token punctuation">.</span>DEFAULT_BUFFER_SIZE<span class="token punctuation">)</span> <span class="token keyword">except</span> KeyboardInterrupt<span class="token punctuation">:</span> <span class="token keyword">return</span></code></pre></div> <p>In <code>close()</code>, we get the configured Git LFS remote URL (if it is set) and then run our actual <code>smudge()</code> implementation. scmrepo’s <code>smudge()</code> method will return a Python file-like object stream for the original file (and not the internal pointer). We then just need to do a series of chunked reads and writes to send the original file data to the next filter in the chain.</p> <p>Since Git LFS <code>smudge</code> behavior is well defined by the <a href="https://github.com/git-lfs/git-lfs/blob/main/docs/spec.md#intercepting-git" title="Git LFS specification" target="_blank" rel="nofollow noopener noreferrer">Git LFS specification</a> we will not go into a detailed explanation of our Python implementation here. In short, <code>smudge()</code> verifies that the input data is a valid Git LFS pointer, reads the Git LFS object ID from the pointer, and then loads the appropriate object from Git LFS storage. If the specified object ID is not available in the local Git LFS storage, it will be fetched from the remote Git LFS server.</p> <p><em>The complete source code for our scmrepo Git LFS filter is available on Github: <a href="https://github.com/iterative/scmrepo/blob/main/src/scmrepo/git/backend/pygit2/filter.py" title="scmrepo filter.py" target="_blank" rel="nofollow noopener noreferrer">filter.py</a>, <a href="https://github.com/iterative/scmrepo/blob/main/src/scmrepo/git/lfs/smudge.py" title="scmrepo smudge.py" target="_blank" rel="nofollow noopener noreferrer">smudge.py</a></em></p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This recent update to DVC marks a significant milestone by eliminating the prior limitation associated with Git LFS incompatibility. With version 3.31.0, DVC users can seamlessly import files from Git repositories, including platforms like Hugging Face, without needing extra dependencies. The integration of Git LFS support, facilitated by the Dulwich and pygit2 libraries, streamlines managing datasets and large objects in a Git repository.</p> <p>This reinforces DVC's commitment to providing a versatile and user-friendly open-source version control solution for diverse Git repositories.</p>https://dvc.org/blog/turn-vs-code-into-ml-platformhttps://dvc.org/blog/turn-vs-code-into-ml-platformThu, 16 Nov 2023 00:00:00 GMT<p><strong>Need an easy way to run and track your experiments?</strong> Install the DVC extension from the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code marketplace</a>. Then, run experiments, visualize deep learning metrics in real-time, compare experiments, and save the ones you like - all from your IDE.</p> <p><img src="https://dvc.org/2023-11-16/run-python-file-f7dd9309e0f6abf350eac3c8a083cef2.gif" alt="Run Python file"><em>Run a Python file and see results</em></p> <p><strong>Want to simplify your chaotic ML iterations?</strong> With the DVC extension, you can run reproducible workflows directly from VS Code.</p> <p><img src="https://dvc.org/2023-11-16/modify-and-run-00de7b58ccfe3155924ed5da316ce9b8.gif" alt="Run a new experiment"><em>Run a new experiment directly from VS Code</em></p> <p>Live plots let you visualize metrics from these runs in real-time.</p> <p><img src="https://dvc.org/2023-11-16/live-plots-cacfadaa43860e33db3ac9286fde2881.gif" alt="View plots in real-time"><em>View plots in real-time</em></p> <p>To make it easy for you to create the workflows, the extension even auto-generates code snippets.</p> <p><img src="https://dvc.org/2023-11-16/auto-generate-code-fffffd810d133db3cc2bd2b08aa2cb99.gif" alt="Auto-generate pipeline specifications"><em>Auto-generate pipeline specifications</em></p> <p><strong>Tired of context switching throughout the day?</strong> The integration of DVC with VS Code empowers you to do everything from within your IDE. No more jumping from notebooks to the terminal to IDE to web browsers to Git.</p> <h1 id="why-a-dvc-extension-for-vs-code" style="position:relative;">Why a DVC extension for VS Code?<a href="#why-a-dvc-extension-for-vs-code" aria-label="why a dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> has helped individual ML developers and teams in companies like UBS, DeGould, Exscientia, Kibsi and many more to standardize their ML workflows on top of their cloud resources and Git repositories.</p> <p>Visual Studio Code (VS Code) is, by far, the most popular IDE for all developers, including ML engineers.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ffd03af5b25e2fc25799a4bf5a38f6e7/39600/so-survey.png" alt="StackOverflow survey" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Source: <a href="https://survey.stackoverflow.co/2022/#section-most-popular-technologies-integrated-development-environment" target="_blank" rel="nofollow noopener noreferrer">StackOverflow survey 2022</a></em></p> <p>The DVC extension makes VS Code even more useful for you by providing you a VS Code-native environment for managing your ML projects. You get the power of DVC with capabilities beyond what's available in the terminal!</p> <p>With over 34 thousand installs, the extension is proven to help you solve the challenges of creating and managing your Machine Learning workflows.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d7618fc8b5efaf80fc98f5fa6a4767ad/39600/dvc-extension-in-vs-code-marketplace.png" alt="DVC extension in the VS Code marketplace" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>DVC extension in the VS Code marketplace</em></p> <h1 id="getting-started-with-the-dvc-extension-for-vs-code" style="position:relative;">Getting started with the DVC extension for VS Code<a href="#getting-started-with-the-dvc-extension-for-vs-code" aria-label="getting started with the dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>To install the extension, open VS Code and search for "DVC" in the Extensions view. Or install the extension from the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code marketplace</a>.</p> <p>Now, create a DVC repository for your machine learning project and start experimenting! Here’s how to do this:</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/6KtIRVfr61E?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>To see all the DVC commands supported by the extension, open the DVC Command Palette using F1 or ⇧⌃P on Windows/Linux or ⇧⌘P on macOS and typing DVC.</p> <h1 id="its-always-getting-better" style="position:relative;">It’s always getting better!<a href="#its-always-getting-better" aria-label="its always getting better permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Over the last one year, we’ve made several enhancements in the DVC extension for VS Code. For some of the interesting stuff you can do with it, watch the videos <a href="https://www.youtube.com/watch?v=VMYggTLm_-U&list=PL7WG7YrwYcnBo3ZBapzKNxtBcfNjGDQMM&index=5" target="_blank" rel="nofollow noopener noreferrer">here</a>. As a mark of the extension reaching a new level of maturity, today we have launched it in <a href="https://www.producthunt.com/posts/dvc-extension-for-vs-code" target="_blank" rel="nofollow noopener noreferrer">Product Hunt</a>. It would be awesome if you check it out and leave us some feedback and support!</p> <p>We are excited to see how the DVC VS Code extension helps you simplify your ML workflows. For more information:</p> <ul> <li>DVC extension in the VS Code marketplace: <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">https://marketplace.visualstudio.com/items?itemName=Iterative.dvc</a></li> <li>GitHub repository: <a href="https://github.com/iterative/vscode-dvc" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/vscode-dvc</a></li> <li>DVC documentation: <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/</a></li> <li>DVC community forum: <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/chat</a></li> </ul>https://dvc.org/blog/leveraging-llms-in-chatbots-the-dvc-approachhttps://dvc.org/blog/leveraging-llms-in-chatbots-the-dvc-approachMon, 25 Sep 2023 00:00:00 GMT<p>In the modern world of Machine Learning (ML) and Natural Language Processing (NLP), there's been a surge in applications built on top of Large Language Models (LLMs). There has been an almost exponential adoption in applications and companies building applications from LLMs across a variety of areas.</p> <p>In this post we will show how DVC can make designing LLM applications more efficient and organized. We take a Retrieval-Augmented Generation (<a href="https://artificialcorner.com/retrieval-augmented-generation-rag-a-short-introduction-21d0044d65ff" target="_blank" rel="nofollow noopener noreferrer">RAG</a>) approach and illustrate how we can break down the various phases of a RAG chatbot and version them with DVC. We can use DVC to both "time travel" and avoid the need to re-compute stages unnecessarily with little extra effort.</p> <h2 id="the-rise-of-chatbots-in-technical-advice" style="position:relative;">The Rise of Chatbots in Technical Advice<a href="#the-rise-of-chatbots-in-technical-advice" aria-label="the rise of chatbots in technical advice permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Chatbots are finding a natural fit in providing technical advice. For our product, DVC, which has amassed significant popularity, we've introduced a chatbot designed to streamline user experience. Our bot sources information not just from our official documentation but also from our community discussions on Discord. This creates a broader knowledge base than using our official documentation alone, and ensures a balanced mix of official guidelines and community insights.</p> <h2 id="the-rag-approach" style="position:relative;">The RAG Approach<a href="#the-rag-approach" aria-label="the rag approach permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our chatbot uses the Retrieval-Augmented Generation (RAG) approach. The <a href="https://towardsdatascience.com/rag-vs-finetuning-which-is-the-best-tool-to-boost-your-llm-application-94654b1eaba7" target="_blank" rel="nofollow noopener noreferrer">debate</a> between the efficacy of RAG vs. fine-tuning methods is ongoing and lively. However, our choice leans towards RAG due to its simplicity and relative computation efficiency for quickly iterating on different approaches.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf4e94f792cc00d8e6f80269a1a3e8bd/39600/flowchart.png" alt="RAG flowchart" title="RAG flowchart" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Illustration of the RAG approach: First we build a vector database with chunks of text. After we retrieve chunks relevant to the user query from the vector database, we insert those chunks into the prompt to give the LLM context.</em></p> <h2 id="citation-a-key-differentiator" style="position:relative;">Citation: A Key Differentiator<a href="#citation-a-key-differentiator" aria-label="citation a key differentiator permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A common complaint about chatbots is that they do not cite any sources, which leaves users with few avenues to validate the information provided by the chatbot.</p> <p><img src="https://dvc.org/2023-09-25/chat_bot_gif-c5be4d288070c6cc9d0a913486e15dc1.gif" alt="Chatbot in action video" title="=800"><em>Demo of our chatbot</em></p> <p>Our chatbot is able to cite the sources of its answers. It does with using the LangChain <a href="https://api.python.langchain.com/en/latest/chains/langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain.html" target="_blank" rel="nofollow noopener noreferrer">RetrievalQAWithSourcesChain</a>. This is a key feature for many users.</p> <h2 id="building-the-chatbot-using-dvc" style="position:relative;">Building the Chatbot Using DVC<a href="#building-the-chatbot-using-dvc" aria-label="building the chatbot using dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our chatbot builds on top of the <a href="https://github.com/hwchase17/notion-qa" target="_blank" rel="nofollow noopener noreferrer">LangChain Notion Question-Answering</a> example using DVC to manage the pipeline. Interestingly, while we built a chatbot for DVC, we also employed DVC in its construction. This seemingly circular approach allowed us to leverage the standard benefits that DVC offers:</p> <ol> <li><strong>Rollback Facility</strong>: The ability to revert to previous versions is invaluable, especially when dealing with unpredictable outputs in response to varying prompts.</li> <li><strong>Efficiency</strong>: DVC prevents redundant computation when updating specific phases, saving both time and computational resources.</li> <li><strong>Visual Representation with DVC DAG</strong>: The Directed Acyclic Graph (DAG) provided by DVC visualizes how the chatbot's construction is broken down into distinct stages, aiding understanding and development.</li> </ol> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">+----------------------+ | discord_dump.zip.dvc | +----------------------+ +-------------------+ | docs_dump.zip.dvc | +-------------------+ * * * +--------+ | expand | +--------+ * * * +--------+ | ingest | +--------+ * * * +-----------+ +-----------------+ | vectorize | | samples.txt.dvc | +-----------+ +-----------------+ *** *** * * ** ** +-----+ | run | +-----+</code></pre></div> <p>The bot is built into a few standard phases for RAG:</p> <ol> <li><code>expand</code>: unzip archives of documents</li> <li><code>ingest</code>: This is how we chunk up the text of the documents into small pieces that we can embed and also put into prompts for the chatbot. The standard <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter" target="_blank" rel="nofollow noopener noreferrer">text splitters</a> make sense for documentation pages, but a dump of 2 years worth of Discord chats require a custom splitter.</li> <li><code>vectorize</code>: Build a <a href="https://github.com/facebookresearch/faiss" target="_blank" rel="nofollow noopener noreferrer">vector database</a> with embeddings of all the text chunks</li> <li><code>run</code>: Extract the relevant text chunks for the sample questions, put into prompts, and call the LLM</li> </ol> <p>DVC allows us to keep the outputs from each stage under version control, and manage the parameterization, with little extra effort. This provides the advantage that if we choose to update the vectorize stage, we can reuse the outputs of the ingest stage without re-running it. Or, if we want to roll back to an old version of vectorize, we can get that intermediate output back without re-running it and without the high chance of making a mistake in versioning if we try to do that manually.</p> <p>Both the vectorize and run stages use the OpenAI API. So, repeated computation not only costs time but also actual dollars.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 616px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b381d3c919972544273b5ac34e7be75a/0e253/docs_text_chunking.png" alt="Text chunking the official docs" title="Text chunking the official docs" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>We apply a standard text chunker to the markdown for our official documentation. It contains a few options for chunk size and desired overlap between chunks. DVC helps organize these parameters.</em></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 504px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4e25c4bf35758804f5933504eadd3a9d/0dcb2/discord_text_chunking.png" alt="Text chunking the public discord" title="Text chunking the public discord" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>For our discord, we group together successive messages from the same author and then start a chunk at each message. Putting the author and datetime into the prompts in the later stages can be formatted in various ways. Experimenting with these options is easier when you have DVC.</em></p> <h2 id="the-importance-of-rollback" style="position:relative;">The Importance of Rollback<a href="#the-importance-of-rollback" aria-label="the importance of rollback permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Changes in chatbot prompts can have unforeseen consequences. In some cases, they might improve the bot's performance, while in others, they might lead to degradation. Given the computational cost of re-running phases and the unpredictable nature of such changes, rollback doesn't merely refer to reverting to old code. It also allows reverting to older intermediate outputs, making the development process much more computationally efficient and organized.</p> <h2 id="incorporating-the-discord-community-insights" style="position:relative;">Incorporating the Discord Community Insights<a href="#incorporating-the-discord-community-insights" aria-label="incorporating the discord community insights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>One significant factor affecting the performance of our chatbot is the manner in which we segment and integrate text from our Discord channel. Different text-splitting techniques can lead to variance in performance, highlighting the importance of continually refining this integration process. Furthermore, providing useful meta information for sources in Discord can be done in various ways. Again, DVC handles the book keeping of iterating on these approaches without re-running unchanged stages.</p> <h2 id="running-it-yourself" style="position:relative;">Running it Yourself<a href="#running-it-yourself" aria-label="running it yourself permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>First clone the git repository <a href="https://github.com/iterative/llm-demo" target="_blank" rel="nofollow noopener noreferrer">here</a>. Once you have an <a href="https://platform.openai.com/account/api-keys" target="_blank" rel="nofollow noopener noreferrer">OpenAI API key</a>, you can easily get the project going with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>. Re-running the demo from scratch costs about $0.40 USD in credits.</p> <p>First, you need to do a git pull of the code:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> [email protected]:iterative/llm-demo.git </span><span class="token line"><span class="token input">$ </span><span class="token command">cd</span> llm-demo</span></code></pre></div> <p>The training run is all logged in DVC in an S3 store. So, if you are already authenticated on AWS you can get all the data with:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span></span></code></pre></div> <p>To set your environment up to run the code, first install all requirements in a virtual env:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">virtualenv</span> <span class="token function">env</span> <span class="token parameter variable">--python</span><span class="token operator">=</span>python3 </span><span class="token line"><span class="token input">$ </span><span class="token command">source</span> env/bin/activate </span><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt</span></code></pre></div> <p>Then set your OpenAI API key (if you don't have one, get one <a href="https://beta.openai.com/playground" target="_blank" rel="nofollow noopener noreferrer">here</a>):</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">OPENAI_API_KEY</span><span class="token operator">=</span><span class="token punctuation">..</span>.</span></code></pre></div> <p>The preceding spaces prevent the API key from staying in your bash history if that is <a href="https://stackoverflow.com/questions/6475524/how-do-i-prevent-commands-from-showing-up-in-bash-history" target="_blank" rel="nofollow noopener noreferrer">configured</a>.</p> <p>Now you should be ready to re-run the training pipeline. Assuming you have not changed anything, nothing should need to run. Everything can be re-used for the DVC pull:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div> <p>Now you can startup the web UI using:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">streamlit</span> run main.py</span></code></pre></div> <p>The command should open the bot in your web browser. The log of interactions can be found in <code>chat.log</code>.</p> <h2 id="example-of-using-dvc-rollback" style="position:relative;">Example of using DVC rollback<a href="#example-of-using-dvc-rollback" aria-label="example of using dvc rollback permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Let's take a concrete example illustrating how we can use DVC in the bot development, suppose we want to adjust the <code>embedding embedding_ctx_length</code> because we think it can help us save some cost on API calls and lower the interactive latency. To do this in a reproducible way, we first make a git branch to do the change:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> <span class="token parameter variable">-b</span> try_new_embed</span></code></pre></div> <p>Now if we re-run the pipeline with DVC we will notice that it skips re-running the expand and ingest phases because nothing has changed for their dependencies:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'OpenAIEmbeddings.embedding_ctx_length=256'</span> </span>'samples.txt.dvc' didn't change, skipping Stage 'setup' didn't change, skipping 'docs_dump.zip.dvc' didn't change, skipping Stage 'expand' didn't change, skipping Stage 'ingest' didn't change, skipping Running stage 'vectorize': <span class="token line"><span class="token input">$ </span><span class="token command">python</span> vector_store.py </span>...</code></pre></div> <p>We can also version the outputs with DVC:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> dvc.lock params.yaml </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">"new embed model"</span></span></code></pre></div> <p>We can try out the new settings with:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">streamlit</span> run main.py</span></code></pre></div> <p>However, if despite any cost savings we don't like the results with these new settings, we can easily revert back to old pipeline using git and DVC:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> master </span>Switched to branch 'master' Your branch is up to date with 'origin/master'. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span> </span>M faiss_store.pkl M docs.index M results.csv <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>'samples.txt.dvc' didn't change, skipping Stage 'setup' didn't change, skipping 'docs_dump.zip.dvc' didn't change, skipping Stage 'expand' didn't change, skipping Stage 'ingest' didn't change, skipping Stage 'vectorize' didn't change, skipping Stage 'run' didn't change, skipping Data and pipelines are up to date.</code></pre></div> <p>DVC does not need to rerun any stage because it has saved all the old outputs from the master branch. Likewise, we can always switch back to the experimental setup with:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> try_new_embed </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span></span></code></pre></div> <p>Using these few commands, we can use DVC to both "time travel" and avoid the need to re-compute stages unnecessarily with little extra effort.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The benefits of using DVC are shared across most LLM applications. Whether you are working with discord, slack, or a google docs corpus, RAG or fine tuning, using DVC to manage your pipeline will bring similar benefits. The utilization of DVC not only enhances the development process but also brings about reproducible experiments. Given the similarities that most LLM applications share, it's safe to conclude that they could benefit immensely from incorporating DVC in their workflows.</p>https://dvc.org/blog/finetune-llm-pipeline-dvc-skypilothttps://dvc.org/blog/finetune-llm-pipeline-dvc-skypilotFri, 08 Sep 2023 00:00:00 GMT<h2 id="introduction---solving-cloud-resources-and-reproducibility-for-llms" style="position:relative;">Introduction - Solving cloud resources and reproducibility for LLMs<a href="#introduction---solving-cloud-resources-and-reproducibility-for-llms" aria-label="introduction solving cloud resources and reproducibility for llms permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A few of weeks ago, I wrote a <a href="https://alex000kim.com/tech/2023-08-10-ml-experiments-in-cloud-skypilot-dvc/" target="_blank" rel="nofollow noopener noreferrer">post</a> about the challenges of training large ML models, in particular:</p> <ol> <li>the need for more computing power and the complexity of managing cloud resources;</li> <li>the difficulty of keeping track of ML experiments and reproducing results.</li> </ol> <p>There I proposed a solution to these problems by using <a href="https://skypilot.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">SkyPilot</a> and <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> to manage cloud resources and track experiments, respectively.</p> <p>These problems are especially relevant for large language models, where both the model size and the amount of data required for training are <em>very</em> large. In this blog post, I will walk you through an end-to-end production-grade Machine Learning pipeline for performing Supervised Fine-Tuning (SFT) of large language models (LLMs) on conversational data. This project demonstrates the effective use of technologies like <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, <a href="https://github.com/skypilot-org/skypilot" target="_blank" rel="nofollow noopener noreferrer">SkyPilot</a>, HuggingFace <a href="https://github.com/huggingface/transformers" target="_blank" rel="nofollow noopener noreferrer">Transformers</a>, <a href="https://github.com/huggingface/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT</a>, <a href="https://github.com/huggingface/trl" target="_blank" rel="nofollow noopener noreferrer">TRL</a> and others.</p> <p>All the code for this project is available on GitHub:</p> <p><a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace" target="_blank" rel="nofollow noopener noreferrer">https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace</a></p> <h3 id="whats-fine-tuning-and-when-to-use-it" style="position:relative;">What’s fine-Tuning and when to use it<a href="#whats-fine-tuning-and-when-to-use-it" aria-label="whats fine tuning and when to use it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let’s recap the differences between prompt engineering, prompt tuning, and model fine-tuning, three distinct approaches to working with LLMs.</p> <p>Feel free to skip this section if you’re already familiar with these concepts.</p> <details> <summary> Prompt engineering, prompt tuning, and model fine-tuning </summary> <p>Prompt engineering, prompt tuning, and model fine-tuning are three techniques for adapting large language models to downstream tasks. Prompt engineering relies on skillfully designing input prompts, often with demo examples, to steer model behavior without any parameter changes. Prompt tuning takes a more automated approach - learning continuous token embeddings as tunable prompts appended to the input. This keeps the base model frozen but allows the prompts to be optimized. Finally, model fine-tuning adapts all the model’s parameters directly through continued training on downstream data. While fine-tuning can achieve strong performance, prompt engineering and tuning offer greater parameter efficiency and model reuse. However, prompt methods may require more iteration and heuristics to work well.</p> <p>Fine-tuning gives the model maximal flexibility to adapt its entire set (or a subset) of parameters directly on the new data. This end-to-end training approach is especially powerful when the target task or domain differs significantly from the original pre-training data. In such cases, extensive adaptation of the model may be required beyond what is possible through the model’s fixed input representations alone. However, fine-tuning requires re-training large models which can be computationally expensive. It also loses the ability to efficiently share one model across multiple tasks. Overall, fine-tuning tends to be preferred when maximum task performance is critical and training resources are available.</p> <p>Below is a table comparing these techniques:</p> <table><thead><tr><th>Method</th><th>Description</th><th>Advantages</th><th>Disadvantages</th></tr></thead><tbody><tr><td>Prompt Engineering</td><td>Skillfully designing input prompts, often with demo examples, to steer model behavior without parameter changes</td><td>• Efficient parameter reuse <br> • No model re-training needed</td><td>• Can require much iteration and tuning <br> • Limited flexibility to adapt model</td></tr><tr><td>Prompt Tuning</td><td>Learning continuous token embeddings as tunable prompts appended to input, keeps base model frozen</td><td>• Efficient parameter reuse <br> • Automated prompt optimization</td><td>• Less flexible than fine-tuning <br> • Still some manual effort needed</td></tr><tr><td>Model Fine-tuning</td><td>Adapting a subset of model parameters through continued training on new data</td><td>• Allows significant adaptation to new tasks/data <br> • Can achieve very strong performance</td><td>• Can be difficult to set up <br> • Computationally expensive <br> • Loses ability to share model across tasks</td></tr></tbody></table> </details> <h2 id="overview-of-the-project" style="position:relative;">Overview of the Project<a href="#overview-of-the-project" aria-label="overview of the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The project leverages several technologies:</p> <ol> <li><strong><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a></strong> for reproducible ML pipelines: This tool enables us to define the ML workflow as a Directed Acyclic Graph (DAG) of pipeline stages, with dependencies between data, models, and metrics automatically tracked. It also integrates with remote storage like S3 to efficiently version large datasets and model files.</li> <li><strong><a href="https://skypilot.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">SkyPilot</a></strong> for scalable cloud infrastructure: SkyPilot simplifies the process of launching cloud compute resources on demand for development or distributed training. It supports spot instances to reduce training costs and permits the quick set up of remote interactive development environments.</li> <li><strong><a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">HuggingFace</a></strong> and other libraries for efficient training of quantized models: HuggingFace Transformers provides a simple API for training and fine-tuning large transformer models. In combination with bitsandbytes, it enables reduced-precision and quantization-aware training for greater efficiency.</li> </ol> <p>The <a href="https://github.com/artidoro/qlora" target="_blank" rel="nofollow noopener noreferrer">QLoRA</a> quantization technique will allow us to apply 4-bit quantization for model weights. For Llama 7b model, this reduces GPU memory requirements from ~98 GB (with float32 precision) down to ~12 GB (with int4 precision). The screenshot below is from a handy <a href="https://huggingface.co/spaces/hf-accelerate/model-memory-usage" target="_blank" rel="nofollow noopener noreferrer">Model Memory Calculator</a> that helps you calculate how much vRAM is needed to train on a model that can be found on the Hugging Face Hub.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c6afbd26db7f41a6f93eae5d877b45b1/39600/gpu_memory_requirements.png" alt="GPU memory requirements" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Considering the GPU memory overhead due to optimizer states, gradients, and forward activations, we’d need around 16GB in vRAM to fine-tune a 4bit-quantized 7b model. NVIDIA A10 is a good candidate for this (<a href="https://aws.amazon.com/ec2/instance-types/g5/" target="_blank" rel="nofollow noopener noreferrer"><code>g5.2xlarge</code></a> instance on AWS) as it costs a little over $1 per hour for on-demand pricing or $0.35 per hour for spot instance pricing.</p> <p>The total training time will depend on the size of your dataset and the number of epochs you want to train for. But with this setup, I believe it's possible to train a model to achieve decent (better than the base pretrained model) performance on some narrow task for under $50 total.</p> <p>For comparison, if you were fine-tuning the same model but with float16 precision, you’d need one or more NVIDIA A100 (80GB version) or H100 GPUs. Currently, they are almost impossible to get access to due to the high demand (unless you work at one of the <a href="https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini" target="_blank" rel="nofollow noopener noreferrer">“GPU-rich” companies</a>). This kind of cloud hardware can be 5-10 times more expensive. For example, according to this <a href="https://blog.skypilot.co/finetuning-llama2-operational-guide/" target="_blank" rel="nofollow noopener noreferrer">post</a>, it would cost you a little over $300 to fine-tune a non-quantized 7b Llama 2 model on the <a href="https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered" target="_blank" rel="nofollow noopener noreferrer">ShareGPT</a> dataset for 3 epochs.</p> <p>The price, of course, isn’t the only important factor. There are other low-cost Jupyter-based development environments like Google Colab or Kaggle Notebooks. While Jupyter environment is convenient when developing prototypes, the key advantage of the everything-as-code (EaC) approach proposed here is centralizing your code, datasets, hyperparameters, model weights, training infrastructure and development environment in a git repository. With LLMs being notoriously unpredictable, maintaining tight version control over training is critical.</p> <h3 id="setup" style="position:relative;">Setup<a href="#setup" aria-label="setup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To begin, clone the project repository. Then, install SkyPilot and DVC using pip:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ pip <span class="token function">install</span> skypilot<span class="token punctuation">[</span>all<span class="token punctuation">]</span> dvc<span class="token punctuation">[</span>all<span class="token punctuation">]</span></code></pre></div> <p>Next, configure your cloud provider credentials. You can refer to the <a href="https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloud-account-setup" target="_blank" rel="nofollow noopener noreferrer">SkyPilot documentation</a> for more details.</p> <p>Confirm the setup with the following command:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky check</code></pre></div> <p>After configuring the setup, you’ll need to download the data from the read-only remote storage in this project to your local machine, then upload it to your own bucket (where you have write access).</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Pull data from remote storage to local machine</span> $ dvc pull <span class="token comment"># Configure your own bucket in .dvc/config:</span> <span class="token comment"># - AWS: https://iterative.ai/blog/aws-remotes-in-dvc</span> <span class="token comment"># - GCP: https://iterative.ai/blog/using-gcp-remotes-in-dvc</span> <span class="token comment"># - Azure: https://iterative.ai/blog/azure-remotes-in-dvc</span> <span class="token comment"># Push the data to your own bucket</span> $ dvc push</code></pre></div> <h2 id="huggingface-perform-resource-efficient-fine-tuning" style="position:relative;">HuggingFace: Perform Resource Efficient Fine-Tuning<a href="#huggingface-perform-resource-efficient-fine-tuning" aria-label="huggingface perform resource efficient fine tuning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here we’ll walk through the training approach without going into too much detail. Please check the references at the end of this post for more information on the techniques used. We started by loading a pretrained Llama-2 model and tokenizer. To make training even more efficient, we used <code>bitsandbytes</code> and techniques like <a href="https://huggingface.co/blog/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT</a> and <a href="https://github.com/artidoro/qlora" target="_blank" rel="nofollow noopener noreferrer">QLoRA</a> to quantize the model to 4-bit precision.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">get_model_and_tokenizer</span><span class="token punctuation">(</span>pretrained_model_path<span class="token punctuation">,</span> use_4bit<span class="token punctuation">,</span> bnb_4bit_compute_dtype<span class="token punctuation">,</span> bnb_4bit_quant_type<span class="token punctuation">,</span> use_nested_quant<span class="token punctuation">,</span> device_map<span class="token punctuation">)</span><span class="token punctuation">:</span> compute_dtype <span class="token operator">=</span> <span class="token builtin">getattr</span><span class="token punctuation">(</span>torch<span class="token punctuation">,</span> bnb_4bit_compute_dtype<span class="token punctuation">)</span> bnb_config <span class="token operator">=</span> BitsAndBytesConfig<span class="token punctuation">(</span> load_in_4bit<span class="token operator">=</span>use_4bit<span class="token punctuation">,</span> bnb_4bit_quant_type<span class="token operator">=</span>bnb_4bit_quant_type<span class="token punctuation">,</span> bnb_4bit_compute_dtype<span class="token operator">=</span>compute_dtype<span class="token punctuation">,</span> bnb_4bit_use_double_quant<span class="token operator">=</span>use_nested_quant<span class="token punctuation">,</span> <span class="token punctuation">)</span> model <span class="token operator">=</span> AutoModelForCausalLM<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span> pretrained_model_name_or_path<span class="token operator">=</span>pretrained_model_path<span class="token punctuation">,</span> quantization_config<span class="token operator">=</span>bnb_config<span class="token punctuation">,</span> device_map<span class="token operator">=</span>device_map <span class="token punctuation">)</span> model<span class="token punctuation">.</span>config<span class="token punctuation">.</span>use_cache <span class="token operator">=</span> <span class="token boolean">False</span> model<span class="token punctuation">.</span>config<span class="token punctuation">.</span>pretraining_tp <span class="token operator">=</span> <span class="token number">1</span> tokenizer <span class="token operator">=</span> AutoTokenizer<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span>pretrained_model_name_or_path<span class="token operator">=</span>pretrained_model_path<span class="token punctuation">,</span> padding_side<span class="token operator">=</span><span class="token string">"right"</span><span class="token punctuation">,</span> trust_remote_code<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> tokenizer<span class="token punctuation">.</span>pad_token <span class="token operator">=</span> tokenizer<span class="token punctuation">.</span>eos_token <span class="token keyword">return</span> model<span class="token punctuation">,</span> tokenizer</code></pre></div> <p>Then we leveraged the <a href="https://huggingface.co/docs/trl/index" target="_blank" rel="nofollow noopener noreferrer">TRL</a> library’s Supervised Fine-tuning Trainer (SFTTrainer) to efficiently adapt the model to our target domain. The SFTTrainer provides a simple API for text generation:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">train_model</span><span class="token punctuation">(</span>model<span class="token punctuation">,</span> train_dataset<span class="token punctuation">,</span> valid_dataset<span class="token punctuation">,</span> lora_config<span class="token punctuation">,</span> tokenizer<span class="token punctuation">,</span> training_args<span class="token punctuation">,</span> model_adapter_out_path<span class="token punctuation">)</span><span class="token punctuation">:</span> trainer <span class="token operator">=</span> SFTTrainer<span class="token punctuation">(</span> model<span class="token operator">=</span>model<span class="token punctuation">,</span> train_dataset<span class="token operator">=</span>train_dataset<span class="token punctuation">,</span> eval_dataset<span class="token operator">=</span>valid_dataset<span class="token punctuation">,</span> peft_config<span class="token operator">=</span>lora_config<span class="token punctuation">,</span> dataset_text_field<span class="token operator">=</span><span class="token string">"text"</span><span class="token punctuation">,</span> tokenizer<span class="token operator">=</span>tokenizer<span class="token punctuation">,</span> args<span class="token operator">=</span>training_args<span class="token punctuation">,</span> <span class="token punctuation">)</span> cleanup_incomplete_checkpoints<span class="token punctuation">(</span>training_args<span class="token punctuation">.</span>output_dir<span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>CheckpointCallback<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>DVCLiveCallback<span class="token punctuation">(</span>log_model<span class="token operator">=</span><span class="token string">"all"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token keyword">not</span> os<span class="token punctuation">.</span>listdir<span class="token punctuation">(</span>training_args<span class="token punctuation">.</span>output_dir<span class="token punctuation">)</span><span class="token punctuation">:</span> trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Resuming from checkpoint..."</span><span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span>resume_from_checkpoint<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>model<span class="token punctuation">.</span>save_pretrained<span class="token punctuation">(</span>model_adapter_out_path<span class="token punctuation">)</span></code></pre></div> <p>The quantized model can then be efficiently fine-tuned on much less capable hardware while retaining almost the same level of accuracy. By leveraging the pretrained model, tokenization, and efficient training techniques, we were able to effectively customize the model for our use case with far less resources than training from scratch. The pieces fit together nicely to enable state-of-the-art results on a budget.</p> <h2 id="dvc-define-ml-pipeline" style="position:relative;">DVC: Define ML Pipeline<a href="#dvc-define-ml-pipeline" aria-label="dvc define ml pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Writing the code to efficiently fine-tune a large language model is only part of the story. We also need to define a reproducible pipeline that can be run multiple times with different parameters and hyperparameters. This is where DVC comes in. Below are the stages of the pipeline defined in <a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace/blob/main/dvc.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>dvc.yaml</code></a>:</p> <ul> <li><code>generate_identity_data</code>: Generates a small subset of hardcoded conversational data about the model’s identity, creators, etc. saved to <code>identity_subset.jsonl</code>.</li> <li><code>process_orca_data</code>: Takes a subset of the <a href="https://huggingface.co/datasets/Open-Orca/OpenOrca" target="_blank" rel="nofollow noopener noreferrer">Open Orca</a> dataset and converts it to the prompt/completion format, saving to <code>orca_processed_subset.jsonl</code>.</li> <li><code>process_platypus_data</code>: Similarly processes a subset of the <a href="https://huggingface.co/datasets/garage-bAInd/Open-Platypus" target="_blank" rel="nofollow noopener noreferrer">Open Platypus</a> dataset.</li> <li><code>data_split</code>: Splits each of the 3 processed dataset files into train/validation sets.</li> <li><code>merge_data</code>: Concatenates all the train splits and all the validation splits into final <code>train.jsonl</code> and <code>val.jsonl</code>.</li> <li><code>train</code>: Fine-tunes a Llama-2 model on the training data using the <a href="https://github.com/huggingface/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT</a> library and <a href="https://huggingface.co/docs/trl/main/en/sft_trainer" target="_blank" rel="nofollow noopener noreferrer">Supervised Fine-tuning Trainer</a>. Saves fine-tuned model adapters.</li> <li><code>merge_model</code>: Merges the fine-tuned adapter back into the original Llama-2 model.</li> <li><code>sanity_check</code>: Runs a few prompts through the original and fine-tuned model for a quick sanity check.</li> </ul> <p><img src="https://dvc.org/2023-09-08/dvc_dag-8bdb088e1c1034e15f82cca73f3f4360.svg" alt="DVC pipeline DAG"></p> <p>The <a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace/blob/main/params.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>params.yaml</code></a> file contains the project’s configuration values and training hyperparameters.</p> <p>You can try a larger model by changing the <a href="https://github.com/alex000kim/ML-Pipeline-With-DVC-SkyPilot-HuggingFace/blob/main/params.yaml#L15" target="_blank" rel="nofollow noopener noreferrer"><code>train.model_size</code></a> parameter to <code>13b</code> (you might need to either request a larger instance or reduce the batch size to fit in GPU memory).</p> <h2 id="skypilot-run-everything-in-cloud" style="position:relative;">SkyPilot: Run everything in Cloud<a href="#skypilot-run-everything-in-cloud" aria-label="skypilot run everything in cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>You can either develop the project and run experiments interactively in the cloud inside VS Code, or submit a run job to the cloud and pull the results to your local machine.</p> <h3 id="developing-and-running-experiments-interactively-in-the-cloud" style="position:relative;">Developing and Running Experiments Interactively in the Cloud<a href="#developing-and-running-experiments-interactively-in-the-cloud" aria-label="developing and running experiments interactively in the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To launch a cloud instance for interactive development, run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky launch <span class="token parameter variable">-c</span> vscode <span class="token parameter variable">-i</span> <span class="token number">60</span> sky-vscode.yaml</code></pre></div> <p>This SkyPilot command will launch a <a href="https://code.visualstudio.com/docs/remote/tunnels" target="_blank" rel="nofollow noopener noreferrer">VS Code tunnel</a> to the cloud instance.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># sky-vscode.yaml</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> sky<span class="token punctuation">-</span>vscode <span class="token key atrule">resources</span><span class="token punctuation">:</span> <span class="token key atrule">accelerators</span><span class="token punctuation">:</span> A10G<span class="token punctuation">:</span><span class="token number">1</span> <span class="token key atrule">cloud</span><span class="token punctuation">:</span> aws <span class="token key atrule">use_spot</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">workdir</span><span class="token punctuation">:</span> . <span class="token key atrule">file_mounts</span><span class="token punctuation">:</span> <span class="token key atrule">~/.ssh/id_rsa</span><span class="token punctuation">:</span> ~/.ssh/id_rsa <span class="token key atrule">~/.ssh/id_rsa.pub</span><span class="token punctuation">:</span> ~/.ssh/id_rsa.pub <span class="token key atrule">~/.gitconfig</span><span class="token punctuation">:</span> ~/.gitconfig <span class="token key atrule">setup</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> ... pip install -r requirements.txt sudo snap install --classic code ...</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> code tunnel --accept-server-license-terms</span></code></pre></div> <p>Once the tunnel is created, you can open the VS Code instance in your browser by clicking the link in the terminal output.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a992ce025f1618d1bd2e516774f9dd4b/39600/vscode_tunnel.png" alt="VS Code Tunnel" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="submitting-experiment-jobs-to-the-cloud" style="position:relative;">Submitting Experiment Jobs to the Cloud<a href="#submitting-experiment-jobs-to-the-cloud" aria-label="submitting experiment jobs to the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When you are ready to launch a long-running training job, run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky launch <span class="token parameter variable">-c</span> train --use-spot <span class="token parameter variable">-i</span> <span class="token number">30</span> <span class="token parameter variable">--down</span> sky-training.yaml</code></pre></div> <p>This SkyPilot command uses spot instances to save costs and automatically terminates the instance after 30 minutes of idleness. Once the experiment is complete, its artifacts such as model weights and metrics are stored in your bucket (thanks to the <a href="https://dvc.org/doc/command-reference/exp/push"><code>dvc exp push origin</code></a> command in <code>sky-training.yaml</code>).</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># sky-training.yaml</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> sky<span class="token punctuation">-</span>training <span class="token key atrule">resources</span><span class="token punctuation">:</span> <span class="token key atrule">accelerators</span><span class="token punctuation">:</span> A10G<span class="token punctuation">:</span><span class="token number">1</span> <span class="token key atrule">cpus</span><span class="token punctuation">:</span> <span class="token number">8</span> <span class="token key atrule">cloud</span><span class="token punctuation">:</span> aws <span class="token key atrule">disk_size</span><span class="token punctuation">:</span> <span class="token number">1024</span> <span class="token key atrule">workdir</span><span class="token punctuation">:</span> . <span class="token key atrule">file_mounts</span><span class="token punctuation">:</span> <span class="token key atrule">~/.ssh/id_rsa</span><span class="token punctuation">:</span> ~/.ssh/id_rsa <span class="token key atrule">~/.ssh/id_rsa.pub</span><span class="token punctuation">:</span> ~/.ssh/id_rsa.pub <span class="token key atrule">~/.gitconfig</span><span class="token punctuation">:</span> ~/.gitconfig <span class="token key atrule">setup</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> pip install --upgrade pip pip install -r requirements.txt</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> dvc exp run --pull dvc exp push origin</span></code></pre></div> <p>While the model is training you can monitor the logs by running the following command.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ sky logs train <span class="token punctuation">..</span>. <span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">52</span>%<span class="token operator">|</span>█████▏ <span class="token operator">|</span> <span class="token number">28</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">0</span><</span>01:01, <span class="token number">2</span>.38s/it<span class="token punctuation">]</span> <span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">54</span>%<span class="token operator">|</span>█████▎ <span class="token operator">|</span> <span class="token number">29</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">2</span><</span>00:56, <span class="token number">2</span>.28s/it<span class="token punctuation">]</span> <span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">56</span>%<span class="token operator">|</span>█████▌ <span class="token operator">|</span> <span class="token number">30</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">5</span><</span>00:57, <span class="token number">2</span>.39s/it<span class="token punctuation">]</span> <span class="token punctuation">(</span>sky-training, <span class="token assign-left variable">pid</span><span class="token operator">=</span><span class="token number">25305</span><span class="token punctuation">)</span> <span class="token number">57</span>%<span class="token operator">|</span>█████▋ <span class="token operator">|</span> <span class="token number">31</span>/54 <span class="token punctuation">[</span>00:2<span class="token operator"><span class="token file-descriptor important">8</span><</span>01:01, <span class="token number">2</span>.67s/it<span class="token punctuation">]</span> <span class="token punctuation">..</span>.</code></pre></div> <p>Then, you can pull the results of the experiment to your local machine by running:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc exp pull origin</code></pre></div> <h3 id="customizing-the-cloud-instance-and-parameters" style="position:relative;">Customizing the Cloud Instance and Parameters<a href="#customizing-the-cloud-instance-and-parameters" aria-label="customizing the cloud instance and parameters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li> <p>You can change the cloud provider and instance type in the <code>resources</code> section of <code>sky-training.yaml</code> or <code>sky-vscode.yaml</code>.</p> </li> <li> <p>To enable <a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/live-metrics-and-plots" target="_blank" rel="nofollow noopener noreferrer">DVC Studio integration</a>, for real-time monitoring of metrics and plots, add the <code>--env DVC_STUDIO_TOKEN</code> option to the <code>sky launch</code> commands above.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0026972baf62460460e11e3e6d2b56d1/39600/dvc_studio.png" alt="DVC Studio integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> </li> <li> <p>To enable <a href="https://wandb.ai/" target="_blank" rel="nofollow noopener noreferrer">Weights & Biases</a> integration, add the <code>--env WANDB_API_KEY</code> option to the <code>sky launch</code> commands above.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a2fec1735a25b18227687762225dab5d/39600/wandb.png" alt="Weights & Biases integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> </li> </ul> <h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In this post, we walked through an end-to-end production ML pipeline for fine-tuning large language models using several key technologies:</p> <ul> <li>DVC for reproducible pipelines and efficient dataset versioning</li> <li>SkyPilot for launching cloud compute resources on demand</li> <li>HuggingFace Transformers and other libraries for efficient transformer model training</li> <li>Quantization techniques like PEFT and QLoRA for reduced precision and memory usage</li> </ul> <p>We used the everything-as-code (EaC) approach of centralizing code, datasets, hyperparameters, model weights, training infrastructure and development environment in a git repository. Even the most subtle changes to the training setup will be recorded in the git history.</p> <p>We started with a pretrained Llama-2 model and used <code>bitsandbytes</code> to quantize it for 4-bit precision. Then, we leveraged the TRL library’s Supervised Fine-tuning Trainer with PEFT for efficient domain-specific fine-tuning.</p> <p>The resulting pipeline enables state-of-the-art LLM capabilities to be customized for a target use case with modest compute requirements. DVC and SkyPilot enabled this to be built as a reproducible ML workflow using cloud resources efficiently.</p> <p>This demonstrates how proper MLOps tooling and techniques can make large language model fine-tuning achievable even with limited resources. The modular design also makes it easy to swap components like the model architecture, training method, or cloud provider.</p> <h3 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li><a href="https://huggingface.co/blog/peft" target="_blank" rel="nofollow noopener noreferrer">PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware</a></li> <li><a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes" target="_blank" rel="nofollow noopener noreferrer">Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA</a></li> <li><a href="https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications" target="_blank" rel="nofollow noopener noreferrer">Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications</a></li> <li><a href="https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html" target="_blank" rel="nofollow noopener noreferrer">Fine-Tune Your Own Llama 2 Model in a Colab Notebook</a></li> <li><a href="https://blog.skypilot.co/finetuning-llama2-operational-guide/" target="_blank" rel="nofollow noopener noreferrer">Finetuning Llama 2 in your own cloud environment, privately</a></li> </ul>https://dvc.org/blog/sagemaker-model-deploymenthttps://dvc.org/blog/sagemaker-model-deploymentWed, 30 Aug 2023 00:00:00 GMT<p>Amazon SageMaker from AWS is a popular platform for deploying Machine Learning models, showing up in almost all search results for the “best ML deployment platforms today.” So no doubt we’ve had many users ask us how they can deploy their models to SageMaker. If you would also like some help with this, you are in the right place.</p> <p>With DVC pipelines and live metrics tracking using DVCLive and DVC Studio, iterating on your Machine Learning experiments is a simple process. And DVC Model Registry makes logging, tracking and deploying your trained models equally simple. In this article, we’ll walk you through how you can create a training pipeline that saves your trained models to AWS S3, and how you can then deploy the models to different environments in SageMaker automatically!</p> <p>Interested in the final output right now? <a href="https://github.com/iterative/example-get-started-experiments/" target="_blank" rel="nofollow noopener noreferrer">Here’s the code</a>.</p> <h1 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>To follow along, you’ll need to provision the following resources in AWS:</p> <ul> <li>An S3 bucket for saving your models</li> <li>Credentials with write access to the above S3 bucket. You’ll need this during training to save the models.</li> <li>AWS role with <code>AmazonS3FullAccess</code> and <code>AmazonSageMakerFullAccess</code> for reading the model files and deploying them to SageMaker.</li> </ul> <h1 id="first-why-dvc--sagemaker" style="position:relative;">First, why DVC + SageMaker?<a href="#first-why-dvc--sagemaker" aria-label="first why dvc sagemaker permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>DVC provides a unified way to manage your experiments, datasets, models and code. It works on top of Git, enabling you to apply the best software engineering and DevOps practices to your Machine Learning projects. It is also platform agnostic, which means you have full control over the choice of cloud services. And with a range of options for model deployment, including real-time and serverless endpoints, SageMaker is a great choice for hosting models of different sizes and inference frequencies.</p> <h1 id="prequel-dvc-push-to-save-the-models-during-training" style="position:relative;">Prequel: <code>DVC push</code> to save the models during training<a href="#prequel-dvc-push-to-save-the-models-during-training" aria-label="prequel dvc push to save the models during training permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>DVC simplifies setting up <a href="https://dvc.org/doc/user-guide/pipelines/defining-pipelines" target="_blank" rel="nofollow noopener noreferrer">reproducible pipelines</a> that automatically save your model files during model training. Each stage in a DVC pipeline represents a distinct step in the training process. For each stage, you can specify hyperparameters and other dependencies, such as datasets or outputs of previous stages. You can also specify the outputs of each stage, such as metrics, plots, models, and other files. Learn more <a href="https://dvc.org/doc/command-reference/stage/add" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h2 id="create-a-model-file" style="position:relative;">Create a model file<a href="#create-a-model-file" aria-label="create a model file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The <a href="https://github.com/iterative/example-get-started-experiments/blob/main/dvc.yaml#L33" target="_blank" rel="nofollow noopener noreferrer"><code>sagemaker</code> stage</a> of our pipeline creates a tar file (<code>model.tar.gz</code>) of our trained model. We then mark this tar file as an output of the stage:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc stage <span class="token function">add</span> <span class="token parameter variable">-n</span> sagemaker … <span class="token parameter variable">-o</span> model.tar.gz …</code></pre></div> <p>Note that it is not essential to create a separate <code>sagemaker</code> stage like we did. You could also create the tar file as part of <code>train</code> or any other relevant stage. In fact, you could even use the approach without a DVC pipeline, by simply <a href="https://dvc.org/doc/command-reference/add" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>ing the model files or logging them with the <a href="https://dvc.org/doc/dvclive/live/log_artifact" target="_blank" rel="nofollow noopener noreferrer">DVCLive <code>log_artifact()</code> method</a>. But we recommend using a DVC pipeline for easy reproducibility of your ML experiments.</p> <h2 id="configure-dvc-remote" style="position:relative;">Configure DVC remote<a href="#configure-dvc-remote" aria-label="configure dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Additionally, we’ve configured the default <a href="https://dvc.org/doc/user-guide/data-management/remote-storage" target="_blank" rel="nofollow noopener noreferrer">DVC remote</a> to be our s3 bucket:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ dvc remote <span class="token function">add</span> <span class="token parameter variable">-d</span> storage s3://dvc-public/remote/get-started-pools</code></pre></div> <p>This means that whenever we run <a href="https://dvc.org/doc/command-reference/push#push" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, the updated model tar file is pushed to the s3 bucket.</p> <h2 id="run-the-pipeline-to-save-the-model-in-s3" style="position:relative;">Run the pipeline to save the model in S3<a href="#run-the-pipeline-to-save-the-model-in-s3" aria-label="run the pipeline to save the model in s3 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now, every time we run our training pipeline an updated model tarfile is generated, and we <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> it to the remote S3 bucket. By storing large files like the model tar file in remote storages such as s3, DVC makes it possible to track them in Git, maintaining Git as the single source of truth for your projects.</p> <h1 id="track-and-manage-model-versions-in-dvc-model-registry" style="position:relative;">Track and manage model versions in DVC model registry<a href="#track-and-manage-model-versions-in-dvc-model-registry" aria-label="track and manage model versions in dvc model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Our training script <a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/train.py#L72" target="_blank" rel="nofollow noopener noreferrer">logs our model</a> using the <a href="https://dvc.org/doc/dvclive/live/log_artifact" target="_blank" rel="nofollow noopener noreferrer">DVCLive <code>log_artifact()</code> method</a>, which creates an <a href="https://github.com/iterative/example-get-started-experiments/blob/main/results/train/dvc.yaml#L8" target="_blank" rel="nofollow noopener noreferrer">artifact entry</a> of type <code>model</code> in a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">artifacts</span><span class="token punctuation">:</span> <span class="token key atrule">pool-segmentation</span><span class="token punctuation">:</span> <span class="token key atrule">path</span><span class="token punctuation">:</span> ../../models/model.pkl <span class="token key atrule">type</span><span class="token punctuation">:</span> model <span class="token punctuation">...</span></code></pre></div> <p>Because of this, when we <a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/create-a-project#connect-to-a-git-repository-and-add-a-project" target="_blank" rel="nofollow noopener noreferrer">add the project to DVC Studio</a>, the model appears in the <a href="https://studio.datachain.ai/user/~/models" target="_blank" rel="nofollow noopener noreferrer">model registry</a>.</p> <p>Note that there are other ways to register the model in the model registry - you can <a href="https://dvc.org/doc/studio/user-guide/model-registry/add-a-model" target="_blank" rel="nofollow noopener noreferrer">add the model from the Studio UI</a> or manually add it to the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.</p> <p>Once the model is registered in the model registry, you can assign version numbers every time your ML experiment produces a model version that you like. Use the <a href="https://dvc.org/doc/studio/user-guide/model-registry/register-version" target="_blank" rel="nofollow noopener noreferrer"><code>Register version</code></a> option to select the Git commit for the experiment which produced the desired model version, and assign it a <a href="https://semver.org/" target="_blank" rel="nofollow noopener noreferrer">semantic version</a>. Every version registration is saved using specially formatted Git tags, which you can find in the <a href="https://github.com/iterative/example-get-started-experiments/tags" target="_blank" rel="nofollow noopener noreferrer">Git repository</a>.</p> <p><img src="https://dvc.org/2023-08-30/mr-register-version-94e709a5988cb2ef17681de3288d5803.gif" alt="Version registration in the DVC Model Registry"><em>Version registration in the DVC Model Registry</em></p> <h1 id="trigger-model-deployment-with-stage-assignments" style="position:relative;">Trigger model deployment with stage assignments<a href="#trigger-model-deployment-with-stage-assignments" aria-label="trigger model deployment with stage assignments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>So far, you have saved your model versions in your Git repository (as Git tags) and the actual model tar files in S3. Suppose you just registered version <code>1.0.0</code> of your model, and would like to deploy it to your <code>dev</code> environment so that you and your team can evaluate its performance. The model registry simplifies this too, by providing a mechanism to assign stages to model versions and creating specially formatted Git tags representing this action.</p> <p><img src="https://dvc.org/2023-08-30/mr-assign-stage-0bf0129f7d342729c08aa42acd95e1b2.gif" alt="Stage assignment in the DVC Model Registry"><em>Stage assignment in the DVC Model Registry</em></p> <p>Since stage assignment also creates Git tags, you can write a <a href="https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml" target="_blank" rel="nofollow noopener noreferrer">CI/CD action that runs on Git tag push</a>.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token key atrule">push</span><span class="token punctuation">:</span> <span class="token key atrule">tags</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token string">'results/train=pool-segmentation#*'</span></code></pre></div> <p>This action parses the Git tags to determine the model, version and stage. DVC model registry internally uses <a href="https://mlem.ai/doc/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> to save version registrations and stage assignments, and the <a href="https://github.com/iterative/gto-action" target="_blank" rel="nofollow noopener noreferrer">Iterative GTO action</a> can be used in your <a href="https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml" target="_blank" rel="nofollow noopener noreferrer">GitHub actions workflow</a> to parse the Git tags:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/gto<span class="token punctuation">-</span>action@v2</code></pre></div> <p>This action produces the outputs shown below:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outputs</span><span class="token punctuation">:</span> <span class="token key atrule">event</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.event <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token comment"># whether the event is a version registration or a stage assignment</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.name <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token comment"># model name</span> <span class="token key atrule">stage</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.stage <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">version</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> steps.gto.outputs.version <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div> <p>This action is available only in GitHub though; if you’re using GitLab, Bitbucket or some other provider, you can use the <a href="https://mlem.ai/doc/gto/command-reference/check-ref" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/gto/command-reference/check-ref"><code>gto check-ref</code></a> command to parse the Git tags, which follow <a href="https://mlem.ai/doc/gto/user-guide#git-tags-format" target="_blank" rel="nofollow noopener noreferrer">this format</a>.</p> <p>Now, whenever you <a href="https://dvc.org/doc/studio/user-guide/model-registry/assign-stage" target="_blank" rel="nofollow noopener noreferrer"><code>Assign stage</code></a> to a model version, your CI/CD action understands which version of which model was assigned which stage. Then, it can use the <a href="https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/deploy-model.yml#L64" target="_blank" rel="nofollow noopener noreferrer"></a><a href="https://dvc.org/doc/command-reference/get#-url"><code>dvc get –show-url</code></a> command to determine the S3 path of the tar file for the model version.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml">MODEL_DATA=$(dvc get <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>url . model.tar.gz)</code></pre></div> <p>Finally, it can invoke the <a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py" target="_blank" rel="nofollow noopener noreferrer">deployment script</a> with appropriate inputs.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">python sagemaker/deploy_model.py <span class="token punctuation">\</span> <span class="token parameter variable">--name</span> <span class="token variable">${{ needs.parse.outputs.name }</span><span class="token punctuation">}</span> <span class="token punctuation">\</span> <span class="token parameter variable">--stage</span> <span class="token variable">${{ needs.parse.outputs.stage }</span><span class="token punctuation">}</span> <span class="token punctuation">\</span> <span class="token parameter variable">--version</span> <span class="token variable">${{ needs.parse.outputs.version }</span><span class="token punctuation">}</span> <span class="token punctuation">\</span> <span class="token parameter variable">--model_data</span> <span class="token variable">$MODEL_DATA</span> <span class="token punctuation">\</span> <span class="token parameter variable">--role</span> <span class="token variable">${{ secrets.AWS_ROLE_TO_ASSUME }</span><span class="token punctuation">}</span></code></pre></div> <p>This automates the model deployment process, which is very helpful if your model is expected to evolve constantly.</p> <p>Next, we will explain the <a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py" target="_blank" rel="nofollow noopener noreferrer">deployment script</a>.</p> <h1 id="deploy-the-model-to-sagemaker-and-run-inference" style="position:relative;">Deploy the model to SageMaker and run inference<a href="#deploy-the-model-to-sagemaker-and-run-inference" aria-label="deploy the model to sagemaker and run inference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>So far, you’ve seen how you can</p> <p>✅ create and run reproducible pipelines that save the model to S3,</p> <p>✅ track and manage model versions in a web model registry, and</p> <p>✅ assign stages to trigger model deployment.</p> <p>The last step above specifies which model version should be deployed to which environment. Now let’s see how to actually</p> <p>🔲 deploy the model, and</p> <p>🔲 run inference on it.</p> <p>A deployment in SageMaker is called an endpoint. When you deploy your model, you create or update an endpoint. And for running inference, you invoke the endpoint.</p> <p><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html" target="_blank" rel="nofollow noopener noreferrer">There are a few different ways to do the actual deployment</a>, including the <a href="https://sagemaker.readthedocs.io/en/stable/overview.html" target="_blank" rel="nofollow noopener noreferrer">SageMaker Python SDK</a> and the <a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html" target="_blank" rel="nofollow noopener noreferrer">boto3 library</a>. We have chosen to use the SageMaker Python SDK, which has a two-step process for deployment:</p> <ul> <li>create the SageMaker model bundle (<a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py#L38" target="_blank" rel="nofollow noopener noreferrer">click to see the code</a>), and</li> <li>create the endpoint (<a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py#L54" target="_blank" rel="nofollow noopener noreferrer">click to see the code</a>).</li> </ul> <p>Note that if you do not expect your model to be constantly used for inference, you can create a serverless inference endpoint by specifying a <a href="https://github.com/iterative/example-get-started-experiments/blob/main/sagemaker/deploy_model.py#L58" target="_blank" rel="nofollow noopener noreferrer">serverless inference config</a> (learn about the <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html#deploy-model-options" target="_blank" rel="nofollow noopener noreferrer">different inference options</a>).</p> <p>Once deployed, the endpoint status becomes <code>InService</code> in the AWS console.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/dca163debc81f0270e78249320449300/39600/aws-sagemaker-endpoints.png" alt="InService SageMaker Endpoint" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>InService SageMaker Endpoint in the AWS console</em></p> <h2 id="run-inference" style="position:relative;">Run inference<a href="#run-inference" aria-label="run inference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now that your SageMaker deployment is ready, you can run inference using the <a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/endpoint_prediction.py#L35" target="_blank" rel="nofollow noopener noreferrer">SageMaker predictor</a> (for boto3, use <a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html" target="_blank" rel="nofollow noopener noreferrer"><code>invoke_endpoint()</code></a>). <a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/endpoint_prediction.py" target="_blank" rel="nofollow noopener noreferrer">Here is an inference script</a> that pre-processes your input, calls inference, and applies the result mask to the input image to create the output image, and saves the result.</p> <p>Run this script with the following command:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ python src/endpoint_prediction.py <span class="token punctuation">\</span> <span class="token parameter variable">--img</span> <span class="token operator"><</span>jpg-file-path<span class="token operator">></span> <span class="token punctuation">\</span> <span class="token parameter variable">--endpoint_name</span> <span class="token operator"><</span>endpoint-name<span class="token operator">></span> <span class="token punctuation">\</span> <span class="token parameter variable">--output_path</span> <span class="token operator"><</span>output-folder<span class="token operator">></span></code></pre></div> <p>Here's my input image: <span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 355px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cde059626f1aaf9ad7f2bfa14c0d9a8c/03346/input-image.jpg" alt="Input image" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>And the output identifying the swimming pools: <span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 355px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/83fdee51e0a59d67a564fef03e05d512/39600/output-image.png" alt="Output image" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h1 id="now-your-turn" style="position:relative;">Now, your turn!<a href="#now-your-turn" aria-label="now your turn permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Let us know (reach out in <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>) if you run into any issues when trying to deploy your own model to SageMaker. We will be more than happy to help you figure it out!</p>https://dvc.org/blog/dvc-3.0-ml-experiments-data-versioninghttps://dvc.org/blog/dvc-3.0-ml-experiments-data-versioningWed, 14 Jun 2023 00:00:00 GMT<p><a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">DVC 3.0</a> helps you <a href="#experiment-tracking-and-beyond">experiment</a>, from notebook exploration to model management, and works smarter with your <a href="#smarter-cloudremote-storage">cloud/remote storage</a> to make data versioning painless.</p> <h2 id="experiment-tracking-and-beyond" style="position:relative;">Experiment Tracking and Beyond<a href="#experiment-tracking-and-beyond" aria-label="experiment tracking and beyond permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In <a href="https://iterative.ai/blog/dvc-2-0-release" target="_blank" rel="nofollow noopener noreferrer">DVC 2.0</a>, we first released DVC experiments, providing a way to track experiments as hidden, <a href="https://iterative.ai/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">lightweight Git commits</a>, so you don't have to separately manage your experiments and code. Now it's easier to <a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">start tracking experiments</a> from your Python script or notebook (see examples). You only need a Git repo and DVC's Python logging library <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>. You don't need prior DVC knowledge or an existing DVC project.</p> <toggle> <tab title="Pytorch Lightning"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>lightning <span class="token keyword">import</span> DVCLiveLogger <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> trainer <span class="token operator">=</span> Trainer<span class="token punctuation">(</span>logger<span class="token operator">=</span>DVCLiveLogger<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>model<span class="token punctuation">)</span></code></pre></div> </tab> <tab title="Hugging Face"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>huggingface <span class="token keyword">import</span> DVCLiveCallback <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> </tab> <tab title="Keras"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DVCLiveCallback <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span> train_dataset<span class="token punctuation">,</span> validation_data<span class="token operator">=</span>validation_dataset<span class="token punctuation">,</span> callbacks<span class="token operator">=</span><span class="token punctuation">[</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div> </tab> <tab title="General Python API"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live <span class="token keyword">with</span> Live<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span> live<span class="token punctuation">.</span>log_param<span class="token punctuation">(</span><span class="token string">"epochs"</span><span class="token punctuation">,</span> NUM_EPOCHS<span class="token punctuation">)</span> <span class="token keyword">for</span> epoch <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>NUM_EPOCHS<span class="token punctuation">)</span><span class="token punctuation">:</span> train_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span> metrics <span class="token operator">=</span> evaluate_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span> <span class="token keyword">for</span> metric_name<span class="token punctuation">,</span> value <span class="token keyword">in</span> metrics<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>metric_name<span class="token punctuation">,</span> value<span class="token punctuation">)</span> live<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> </tab> </toggle> <p>With the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a>, you get an experiment tracking workbench without any servers or logins. Your experiments are also available in our collaboration hub <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Studio</a> and connected to your Git repo automatically, so you can share, review and merge like you would with code. You can work locally when you want and use Studio to share if and when it suits you, just like in Git.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/u-URI5Lvc-g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="model-management" style="position:relative;">Model Management<a href="#model-management" aria-label="model management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>With the <a href="https://dvc.org/doc/studio/user-guide/model-registry" target="_blank" rel="nofollow noopener noreferrer">Studio Model Registry</a>, you can use DVC to manage your entire model lifecycle inside your Git workflow, from creating the model to deploying it in any deployment system. Our ethos for model management is consistent with everything else we do - It's all about integrating with your existing stack and tools, and empowering you to tie your workflows around GitOps principles and automation.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/wX0KBg8EU5Y?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="cloud-experiments-alpha-release" style="position:relative;">Cloud Experiments (Alpha Release)<a href="#cloud-experiments-alpha-release" aria-label="cloud experiments alpha release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When we released DVC 2.0, we also launched the <a href="https://iterative.ai/blog/dvc-2-0-release#new-method-to-provision-cloud-compute-in-new-cml-release" target="_blank" rel="nofollow noopener noreferrer"><code>cml runner</code></a> command to run continuous integration (CI) on your own cloud instances so you could automate large ML jobs. Cloud experiments build on this technology without CI, meaning less setup (you can configure directly in Studio). With the alpha release of <a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/run-experiments#cloud-experiments" target="_blank" rel="nofollow noopener noreferrer">Studio Cloud Experiments</a>, you can run DVC experiments on your own cloud infrastructure in a few clicks, including with GPU and spot instance support.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/MF5k-qLUiAg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="hyperparameter-optimization" style="position:relative;">Hyperparameter Optimization<a href="#hyperparameter-optimization" aria-label="hyperparameter optimization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC can also help you do hyperparameter optimization by integrating with other tools. You can <a href="https://dvc.org/doc/user-guide/experiment-management/running-experiments" target="_blank" rel="nofollow noopener noreferrer">queue</a> an entire grid search of experiments, configure multiple complex model architectures with <a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra</a> integration, and track your <a href="https://dvc.org/doc/dvclive/ml-frameworks/optuna" target="_blank" rel="nofollow noopener noreferrer">Optuna</a> studies.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/EpzUqvtvZ4c?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="smarter-cloudremote-storage" style="position:relative;">Smarter Cloud/Remote Storage<a href="#smarter-cloudremote-storage" aria-label="smarter cloudremote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We are committed to building the best data versioning experience. This means making DVC work with your existing data stack and not trying to replace it. We have focused on working more closely with cloud storage (and non-cloud storage) by making DVC not only faster but smarter.</p> <h3 id="minimizing-downloads" style="position:relative;">Minimizing Downloads<a href="#minimizing-downloads" aria-label="minimizing downloads permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Avoiding unnecessary downloads saves time and space that could never be accomplished by transfer speedups alone. You can now <a href="https://dvc.org/doc/user-guide/data-management/modifying-large-datasets" target="_blank" rel="nofollow noopener noreferrer">add or modify</a> individual files in a larger dataset. If you have a large dataset in remote storage, you can pull and modify any file without needing to download the full dataset.</p> <p><img src="https://dvc.org/2023-06-14/dvc-part-update-f045f6d718b6d1a25267598a06ef0558.gif" alt="partial-add" title="Add or modify files in a dataset."></p> <p>You can also run or verify a pipeline <a href="https://dvc.org/doc/user-guide/pipelines/running-pipelines#pull-missing-data" target="_blank" rel="nofollow noopener noreferrer">without pulling data</a> first. You can skip downloading data for stages that haven't changed and automatically download only the data needed for stages that have changed.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/CuorzMAUbgU?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="cloud-versioning" style="position:relative;">Cloud Versioning<a href="#cloud-versioning" aria-label="cloud versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You shouldn't have to create extra copies of data that's already backed up and versioned on the cloud. DVC <a href="https://dvc.org/doc/user-guide/data-management/cloud-versioning" target="_blank" rel="nofollow noopener noreferrer">cloud versioning</a> enables you to import data that's already versioned by your cloud provider. In the example below, DVC knows not to push any data to its own storage because it is already versioned by the cloud. Pulling the data later will recover it from its original source location.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import-url</span> <span class="token parameter variable">--version-aware</span> s3://mybucket/data </span>Importing 's3://mybucket/data' -> 'data' <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> </span>Everything is up to date.</code></pre></div> <h3 id="pythonic-api" style="position:relative;">Pythonic API<a href="#pythonic-api" aria-label="pythonic api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You may need to work with your cloud data outside of the command-line workflow of pushing and pulling. The <a href="https://dvc.org/doc/api-reference/dvcfilesystem" target="_blank" rel="nofollow noopener noreferrer">DVCFileSystem</a> API enables you to read and manage files and directories from remote DVC repos like you would for a local filesystem. In the example below, each file in the <code>data/prepared</code> directory is streamed in as text.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">from</span> dvc<span class="token punctuation">.</span>api <span class="token keyword">import</span> DVCFileSystem <span class="token operator">>></span><span class="token operator">></span> repo <span class="token operator">=</span> <span class="token string">"https://github.com/iterative/example-get-started.git"</span> <span class="token operator">>></span><span class="token operator">></span> fs <span class="token operator">=</span> DVCFileSystem<span class="token punctuation">(</span>repo<span class="token punctuation">,</span> rev<span class="token operator">=</span><span class="token string">"main"</span><span class="token punctuation">)</span> <span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">for</span> f <span class="token keyword">in</span> fs<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">"data/prepared"</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> text <span class="token operator">=</span> fs<span class="token punctuation">.</span>read_text<span class="token punctuation">(</span>f<span class="token punctuation">)</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment"># process the data</span></code></pre></div> <h3 id="faster-performance" style="position:relative;">Faster Performance<a href="#faster-performance" aria-label="faster performance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Sometimes you just need faster performance, especially for large data downloads and uploads. We have focused on improving performance where it matters most. For example, pushing data to S3 is 2.5x faster in DVC 3.0 than in early versions of DVC 2.x according to our benchmarks.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 359px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e6987b7fd311eff2bf6da88e7cc450f0/39600/dvc-push-s3.png" alt="push-s3" title="Time to push to S3." loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h1 id="thank-you" style="position:relative;">Thank You!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Our constant interaction with the DVC community gives us feedback on what should be improved. We heard from you that the ML landscape is already complex and you want to keep your tools simple. That's why many of the new "features" are improvements to existing functionality, and why we are building this stack of tools to make DVC easier, more flexible, and the solid choice for your MLOps workflows.</p> <p>Finally, none of these improvements would be possible without the support of the teams who work on the entire DVC stack.</p> <p>Thanks to all of you who make DVC and its community what it is!</p> <h1 id="get-started-with-the-dvc-30-stack" style="position:relative;">Get Started with the DVC 3.0 Stack<a href="#get-started-with-the-dvc-30-stack" aria-label="get started with the dvc 30 stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Get started with DVC 3.0 or the other tools in the DVC stack:</p> <ul> <li><a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">DVC 3.0</a></li> <li><a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Studio</a></li> <li><a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a></li> </ul>https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-2https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-2Wed, 10 May 2023 00:00:00 GMT<p>In the <a href="https://iterative.ai/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-1/" target="_blank" rel="nofollow noopener noreferrer">previous post</a>, we discussed how DVC simplifies physical simulation pipelines and data management. This post discusses how to run simulations in the cloud, run new experiments, and visualize simulation results with <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> and other tools.</p> <p>In this post, you will learn how to:</p> <ol> <li> <p>Manage computational resources on AWS and start and shut down EC2 instances for simulation experiments.</p> </li> <li> <p>Run new <a href="https://www.openfoam.com/" target="_blank" rel="nofollow noopener noreferrer">OpenFOAM</a> simulations in a cloud using Iterative Studio and <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>.</p> </li> <li> <p>Use Iterative Studio to view simulation results and DVC plots online.</p> </li> </ol> <p>This post is a result of collaboration between the <a href="http://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative.ai</a> and <a href="https://plasmasolve.com/about-us/" target="_blank" rel="nofollow noopener noreferrer">PlasmaSolve</a> teams. PlasmaSolve was founded in 2016 by plasma physicists and software engineers to provide a platform for cutting-edge physics simulation services and research. The PlasmaSolve team strives to deliver top-notch solutions and well-designed physics simulations to speed up research and reduce development costs using various open-source and commercial simulation tools.</p> <h1 id="run-simulations-in-the-cloud-with-gitlab-and-cml" style="position:relative;">Run simulations in the cloud with GitLab and CML<a href="#run-simulations-in-the-cloud-with-gitlab-and-cml" aria-label="run simulations in the cloud with gitlab and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <admon type="tip"> <p>For this part of the post, we follow the <code>main</code> branch in the <a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/tree/main" target="_blank" rel="nofollow noopener noreferrer">demo repository</a>. Please follow the <a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/blob/main/README.md" target="_blank" rel="nofollow noopener noreferrer">README</a> to prepare your environment and install dependencies.</p> </admon> <p>OpenFOAM simulations can be computationally intensive, requiring access to high-performance computing resources or a cluster of computers to solve large or complex problems.</p> <p>To run the demo simulation in AWS we may apply <a href="https://cml.dev/doc" target="_blank" rel="nofollow noopener noreferrer">CML (Continuous Machine Learning)</a>. CML can start a new AWS EC2 instance to run a new simulation experiment and shut it down when it’s done.</p> <p>The full configuration for the demo CI pipeline can be found in the <a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/blob/main/.gitlab-ci.yml" target="_blank" rel="nofollow noopener noreferrer"><code>.gitlab-ci.yml</code></a> file.</p> <p>The demo project shows an example of how to integrate CML into GitLab CI configuration. The pipeline has two stages: <code>build</code> and <code>run</code>. The <code>build</code> stage has a single job that builds a docker image based on the specified <code>Dockerfile</code>, pushes the image to Amazon Elastic Container Registry (ECR), and logs in to the registry. The <code>run</code> stage has three jobs: <code>launch</code>, <code>run</code>, and <code>report</code>. The <code>launch</code> job launches an EC2 instance on Amazon Web Services (AWS) and the <code>run</code> job runs a simulation on the instance. The <code>report</code> job generates a report on the simulation results. Visual representations of the CI pipeline and used AWS services are shown in the diagram below.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f11f58685a7da2c3a6fa495b06c261d2/39600/architecture.png" alt="CML with Gitlab CI configuration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>CML with Gitlab CI configuration</em></p> <h2 id="using-aws-computational-resources" style="position:relative;">Using AWS computational resources<a href="#using-aws-computational-resources" aria-label="using aws computational resources permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When a workflow requires computational resources (such as GPUs), CML can automatically allocate cloud instances using <a href="https://cml.dev/doc/ref/runner" target="_blank" rel="nofollow noopener noreferrer">cml runner</a>. You can spin up instances on AWS, Azure, GCP, or Kubernetes (<a href="https://cml.dev/doc/self-hosted-runners#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">see below</a>). Alternatively, you can connect to <a href="https://cml.dev/doc/self-hosted-runners#on-premise-local-runners" target="_blank" rel="nofollow noopener noreferrer">any other computing provider or an on-premise (local) machine</a>.</p> <p>Below is an example of the GitLab CI <code>launch</code> job configuration that allocates AWS instances using <code>cml runner</code> command. Users may define the region, instance type, and storage size that are needed:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">launch</span><span class="token punctuation">:</span> <span class="token key atrule">stage</span><span class="token punctuation">:</span> run <span class="token key atrule">rules</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">changes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>dvc.yaml<span class="token punctuation">,</span> params.yaml<span class="token punctuation">,</span> .gitlab<span class="token punctuation">-</span>ci.yml<span class="token punctuation">]</span> <span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token scalar string"> cml runner launch --cloud=aws --cloud-region=$AWS_DEFAULT_REGION --cloud-type=m5.2xlarge --cloud-hdd-size=32 --labels=cml --docker-volumes="/home/.cml/cache:/home/.cml/cache"</span></code></pre></div> <h2 id="setup-ci-jobs-to-run-a-simulation" style="position:relative;">Setup CI jobs to run a simulation<a href="#setup-ci-jobs-to-run-a-simulation" aria-label="setup ci jobs to run a simulation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To run a new simulation experiment using the <code>cml runner</code> we need to specify the <code>cml</code> tag in the <code>run</code> job and run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token key atrule">stage</span><span class="token punctuation">:</span> run <span class="token key atrule">tags</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>cml<span class="token punctuation">]</span> <span class="token key atrule">rules</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">changes</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>params.yaml<span class="token punctuation">,</span> .gitlab<span class="token punctuation">-</span>ci.yml<span class="token punctuation">]</span> <span class="token key atrule">image</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span>AWS_CONTAINER_IMAGE<span class="token punctuation">}</span> <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">...</span> <span class="token comment"># Run an experiment</span> <span class="token punctuation">-</span> dvc pull <span class="token punctuation">|</span><span class="token punctuation">|</span> echo "Pull failed" <span class="token comment"># Pull outputs of previous simulation if any</span> <span class="token punctuation">-</span> dvc exp run <span class="token punctuation">-</span>f <span class="token punctuation">-</span> dvc push <span class="token comment"># Save results</span> <span class="token punctuation">-</span> rsync <span class="token punctuation">-</span>r ./ /home/.cml/cache/run <span class="token comment"># Share results with 'report' job</span></code></pre></div> <p>Using <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> command helps to download the results of the previous experiments from the remote storage. Checking versions of previous results and DVC pipeline stage dependencies, DVC may skip running stages that do not need to be run and save a lot of time and computational resources. After the simulation completes, <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> uploads the new results back to the remote storage.</p> <p>After the <code>run</code> job completes, the <code>report</code> job prepares and publishes the CML report to the associated Git commit. For this, we need to build a <code>report.md</code> file with all text & plots in Markdown format, and use the <code>cml comment create</code> command to publish this report and create a pull request.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">report</span><span class="token punctuation">:</span> <span class="token punctuation">...</span> <span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token comment"># Python, DVC, & CML pre-installed</span> <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">...</span> <span class="token comment"># Create CML report</span> <span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string"> cat <<EOF > report.md ... ![](sonicFoam/postProcessing/float_pressure.png) EOF</span> <span class="token punctuation">-</span> cml comment create <span class="token punctuation">-</span><span class="token punctuation">-</span>publish<span class="token punctuation">-</span>native report.md <span class="token punctuation">-</span> cml pr create .</code></pre></div> <p>In some cases, these reports may help to collaborate with teammates using a Git workflow.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/20456cd9f4b2b89b286c5eda678f615b/39600/git_report.png" alt="A report posted after the simulation runs in the pull request" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>A report posted after the simulation runs in the Pull Request</em></p> <h2 id="setup-gitlab-ci-variables" style="position:relative;">Setup GitLab CI variables<a href="#setup-gitlab-ci-variables" aria-label="setup gitlab ci variables permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To run simulations in AWS with GitLab CI & CML, it's recommended to use provider-managed policies/roles and then explicitly limit the permissions further if possible. <a href="https://cml.dev/doc/ref/runner?tab=AWS#common-permissions" target="_blank" rel="nofollow noopener noreferrer">Here is a set of common permissions required by CML</a>.</p> <p>In this demo we used the following CI variables in the project <code>Settings → CI/CD → Variables</code>:</p> <ul> <li><code>AWS_ACCESS_KEY_ID</code></li> <li><code>AWS_SECRET_ACCESS_KEY</code></li> <li><code>AWS_SESSION_TOKEN</code> - it is optional and depends on the AWS organization settings.</li> <li><code>REPO_TOKEN</code> - a <a href="https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html" target="_blank" rel="nofollow noopener noreferrer">personal access token</a> with the <code>api</code>, <code>read_repository</code> and <code>write_repository</code> scopes. Find more details in <a href="https://cml.dev/doc/self-hosted-runners?tab=GitLab#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">CML docs on Personal Access Token</a></li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7fcdd3ce246a70a4c96c8a5f6685807e/39600/ci_vars.png" alt="Examples of CI variables in GitLab" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Examples of CI variables in GitLab</em></p> <admon type="tip"> <p>Note: → AWS_SESSION_TOKEN is not required for most users. It’s specific to Iterative's sandbox account. → REPO_TOKEN - a personal access token with the api, read_repository and write_repository scopes. Find more details in CML docs on <a href="https://cml.dev/doc/self-hosted-runners#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">Personal Access Token.</a></p> </admon> <h1 id="experimenting-and-visualization-simulation-results-in-iterative-studio" style="position:relative;">Experimenting and visualization simulation results in Iterative Studio<a href="#experimenting-and-visualization-simulation-results-in-iterative-studio" aria-label="experimenting and visualization simulation results in iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> is a web application that you can access online or even host on-prem. Using the power of leading open-source tools <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>, and <a href="https://git-scm.com/" target="_blank" rel="nofollow noopener noreferrer">Git</a>, enables you to seamlessly manage data, run and track experiments, and visualize and share results.</p> <h2 id="run-a-new-simulation" style="position:relative;">Run a new simulation<a href="#run-a-new-simulation" aria-label="run a new simulation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Using Iterative Studio we can run new simulation experiments in the Cloud and visualize results in Studio UI.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2023-05-10/studio-run-new-simulation-5c1a866efc1c8591b67c3b2463c940e2.mp4" type="video/mp4"> Your browser does not support the video tag. </video><em>Example of running a new simulation experiment via Iterative Studio</em></p> <h2 id="visualize-simulation-results" style="position:relative;">Visualize simulation results<a href="#visualize-simulation-results" aria-label="visualize simulation results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Iterative Studio helps to visualize simulation result images and DVC plots just after the simulation is complete. Studio allows one to plot images and metrics, and compare them with previous simulations.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2023-05-10/studio-visualize-simulation-results-5c1a866efc1c8591b67c3b2463c940e2.mp4" type="video/mp4"> Your browser does not support the video tag. </video><em>Example of visualization of simulation results in Iterative Studio</em></p> <h1 id="visualize-the-simulation-outputs-with-paraview" style="position:relative;">Visualize the simulation outputs with ParaView<a href="#visualize-the-simulation-outputs-with-paraview" aria-label="visualize the simulation outputs with paraview permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>OpenFOAM includes several utilities for visualizing simulation results, including ParaView, which is a popular open-source visualization tool. Users can use these tools to generate plots, contour plots, and volume renderings of simulation results.</p> <p>DVC can help to download the simulation outputs and visualize them locally. One could do a simple command to get all the data generated by the simulation:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span></span></code></pre></div> <p>Downloaded data can be visualized with third-party tools like ParaView.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2023-05-10/ParaView_sonicFoam-388acefd9308949032f41976a4862e26.mp4" type="video/mp4"> Your browser does not support the video tag. </video> <em>Example for sonicFoam simulation results visualized in ParaView</em></p> <h1 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>This post details how Iterative tools help in physical and computational simulations. For this purpose, we created a demo project built with OpenFOAM. The demo shows how to set up DVC for simulation experiments and data management. CML is used in the GitLab CI pipeline to manage computational resources on AWS. Iterative Studio is then used as a UI to visualize simulation results and run new simulations in a few clicks.</p> <p>Overall, DVC, CML, and Iterative Studio can help OpenFOAM users:</p> <ol> <li> <p>Reduce the complexity of simulation pipelines and automate tasks such as running simulations, post-processing results, and generating reports.</p> </li> <li> <p>Manage and track the data and code associated with your OpenFOAM simulations, and make it easier to reproduce simulation results. Store simulation data on-premises or in the cloud using a variety of storage types, such as S3.</p> </li> <li> <p>Manage simulation experiments with simple YAML config files.</p> </li> <li> <p>Manage computational resources on AWS and start and shut down EC2 instances for simulation experiments.</p> </li> <li> <p>Iterative Studio provides a user-friendly interface for simulation results, visualization, and running new simulations quickly.</p> </li> <li> <p>Iterative Studio allows users to view and share simulation results and DVC plots online, without the need to download and visualize results locally.</p> </li> </ol> <h1 id="references" style="position:relative;">References<a href="#references" aria-label="references permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <ul> <li><a href="https://www.simscale.com/blog/openfoam-users-should-try-simscale/" target="_blank" rel="nofollow noopener noreferrer">Why OpenFOAM Users Should Try SimScale</a></li> <li><a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">OpenFoam - Tutorial Guide: Supersonic flow over a forward-facing step</a></li> <li><a href="https://openfoamwiki.net/index.php/ScalarTransportFoam" target="_blank" rel="nofollow noopener noreferrer">Introduction to ScalarTransportFoam solver on OpenFoamWiki</a></li> <li><a href="https://develop.openfoam.com/Development/openfoam/-/tree/master/tutorials/basic/scalarTransportFoam" target="_blank" rel="nofollow noopener noreferrer"><code>scalarTransportFoam</code> Tutorial</a></li> <li><a href="https://www.researchgate.net/profile/Ingo-Riess/post/How_to_model_smoke_propagation_for_an_existing_velocity_field_using_scalarTransportFoam_in_OpenFOAM/attachment/5cee6f723843b0b98254daac/AS%3A763860613099524%401559129970722/download/5-scalarTransportFoamTutorial.pdf" target="_blank" rel="nofollow noopener noreferrer">Walkthrough and tutorial for <code>scalarTransportFoam</code>: a solver for advection-diffusion of a passive scalar</a>, <em>Eric Paterson and Kevin T. Crofton Department of Aerospace and Ocean Engineering Virginia Polytechnic Institute and State University</em></li> </ul>https://dvc.org/blog/testing-external-contributions-using-github-actions-secretshttps://dvc.org/blog/testing-external-contributions-using-github-actions-secretsThu, 20 Apr 2023 00:00:00 GMT<p>As cloud-native applications become more complex and rely on more third-party services, testing becomes increasingly difficult. One of the most significant challenges for open source projects is testing contributions against complex services that require authentication and are particularly hard to mock.</p> <p>In this blog post, we will explore a simple method for securely running this kind of integration tests on external pull requests, using the GitHub Actions <a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request_target</code> trigger</a> and GitHub <a href="https://docs.github.com/en/actions/deployment/targeting-different-environments" target="_blank" rel="nofollow noopener noreferrer">environments</a> to prevent unauthorized runs:</p> <h2 id="configuration" style="position:relative;">Configuration<a href="#configuration" aria-label="configuration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ol> <li> <p><a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository" target="_blank" rel="nofollow noopener noreferrer">Create some encrypted secrets</a>; a secret named <strong><code>EXAMPLE</code></strong> will be used to illustrate the next sections.</p> </li> <li> <p><a href="https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment#creating-an-environment" target="_blank" rel="nofollow noopener noreferrer">Create an environment</a> named <code>external</code> and add some trusted GitHub users or <a href="https://docs.github.com/en/organizations/organizing-members-into-teams/about-teams" target="_blank" rel="nofollow noopener noreferrer">teams</a> as <a href="https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment#required-reviewers" target="_blank" rel="nofollow noopener noreferrer">required reviewers</a>; they’ll be responsible for approving every run triggered by external contributors.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3900aa107f0569453a5d2a73fb3b4d3/03346/environment.jpg" alt="screenshot of environment settings" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> </li> </ol> <h2 id="workflow" style="position:relative;">Workflow<a href="#workflow" aria-label="workflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote> <p>⚠️ <strong>Warning</strong>: using the <a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request_target</code></a> event without the cautionary measures described below may allow unauthorized GitHub users to open a “pwn request” and exfiltrate secrets; see also this [<a href="https://securitylab.github.com/research/github-actions-preventing-pwn-requests" target="_blank" rel="nofollow noopener noreferrer">1</a>, <a href="https://securitylab.github.com/research/github-actions-untrusted-input" target="_blank" rel="nofollow noopener noreferrer">2</a>, <a href="https://securitylab.github.com/research/github-actions-building-blocks" target="_blank" rel="nofollow noopener noreferrer">3</a>] blog post series from GitHub Security Lab and <a href="https://stackoverflow.com/a/71366152/4654476" target="_blank" rel="nofollow noopener noreferrer">this</a> Stack Overflow answer.</p> </blockquote> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">on</span><span class="token punctuation">:</span> pull_request_target <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">authorize</span><span class="token punctuation">:</span> <span class="token key atrule">environment</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> github.event_name == 'pull_request_target' <span class="token important">&&</span> github.event.pull_request.head.repo.full_name <span class="token tag">!=</span> github.repository <span class="token important">&&</span> 'external' <span class="token punctuation">|</span><span class="token punctuation">|</span> 'internal' <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">test</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> authorize <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v3 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">ref</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> github.event.pull_request.head.sha <span class="token punctuation">|</span><span class="token punctuation">|</span> github.ref <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token punctuation">-</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> printenv EXAMPLE <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">EXAMPLE</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.EXAMPLE <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div> <p>This workflow will be triggered by the <a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request_target</code></a> event, which is <a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#onpushpull_requestpull_request_targetpathspaths-ignore" target="_blank" rel="nofollow noopener noreferrer">similar</a> to the <a href="https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request" target="_blank" rel="nofollow noopener noreferrer"><code>pull_request</code></a> event, but it always passes secrets to workflows triggered from fork pull requests.</p> <p>The <code>authorize</code> job checks if the workflow was triggered from a fork pull request. In that case, the <code>external</code> environment will prevent the job from running until it’s approved. Otherwise (i.e. when pull requests belong to the main repository), the job will run without requiring explicit approval.</p> <p>The <code>test</code> job is where secrets would be used. It <a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idneeds" target="_blank" rel="nofollow noopener noreferrer"><code>needs</code></a> the previous job, so it will never run without explicit approval. The security of this approach is based on the idea of a human approving every run after making sure that there is no malicious code on them, hence it also overrides <a href="https://github.com/actions/checkout#checkout-a-different-branch" target="_blank" rel="nofollow noopener noreferrer">the <code>ref</code> from <code>actions/checkout</code></a> to run on the pull request branch rather than on the main branch.</p> <h2 id="alternatives" style="position:relative;">Alternatives<a href="#alternatives" aria-label="alternatives permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Admittedly, adding this <code>authorize</code> job to the workflow isn’t particularly elegant but, as of January 2023, GitHub doesn’t provide any official guidance on how to achieve a similar result in simpler ways.</p> <ul> <li>In 2020, GitHub <a href="https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/" target="_blank" rel="nofollow noopener noreferrer">introduced</a> an option to send secrets to workflows from fork pull requests, but it only has effect on fork pull requests from private repositories.</li> <li>In 2021, GitHub <a href="https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/" target="_blank" rel="nofollow noopener noreferrer">introduced</a> an option to <a href="https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/enabling-features-for-your-repository/managing-github-actions-settings-for-a-repository#configuring-required-approval-for-workflows-from-public-forks" target="_blank" rel="nofollow noopener noreferrer">require approval for all the outside collaborators</a>, but the <code>pull_request_target</code> event will trigger <a href="https://docs.github.com/en/enterprise-cloud@latest/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks#about-workflow-runs-from-public-forks" target="_blank" rel="nofollow noopener noreferrer">regardless of the approval settings</a>.</li> </ul> <p>Other common alternatives include: skipping tests that need access to secrets, disabling forks, and using pull request labels or code review approvals to control the execution of tests.</p> <h2 id="security-testing" style="position:relative;">Security Testing<a href="#security-testing" aria-label="security testing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This approach has been tested by sporadic security researchers who found our repositories while looking for the <code>pull_request_target</code> trigger, but none of them (<a href="https://github.com/iterative/cml/pull/1130" target="_blank" rel="nofollow noopener noreferrer">#1130</a> [<a href="https://marcyoung.us/post/zuckerpunch" target="_blank" rel="nofollow noopener noreferrer">1</a>] & <a href="https://github.com/iterative/cml/pull/1322" target="_blank" rel="nofollow noopener noreferrer">#1322</a>) were able to bypass this protection. If you find out a way of bypassing it, please feel free to put <a href="https://iterative.ai/security-and-privacy/" target="_blank" rel="nofollow noopener noreferrer">our bug bounty program</a> to good use.</p> <hr> <p>Now you have it! As far as we know, this is currently the most elegant GitHub Actions configuration for testing pull requests from public repository forks using secrets. As maintainers of a lot of open source software, this is close to our hearts!</p> <p>Here are some example usages for <a href="https://github.com/iterative/cml/blob/1be24edaa817de320a657ec3ad1182e145aecef7/.github/workflows/test-deploy.yml#L13-L20" target="_blank" rel="nofollow noopener noreferrer">cml</a> and <a href="https://github.com/iterative/mlem/blob/462384ee7a9fc50196e06942684171e9915f46ae/.github/workflows/check-test-release.yml#L13-L25" target="_blank" rel="nofollow noopener noreferrer">mlem</a>.</p> <p><em>Do you have any better alternative or maybe a similar use case and want to discuss more? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p>https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-1https://dvc.org/blog/managing-openfoam-physical-simulations-with-dvc-cml-studio-part-1Mon, 17 Apr 2023 00:00:00 GMT<h1 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><a href="https://www.openfoam.com/" target="_blank" rel="nofollow noopener noreferrer">OpenFOAM</a> is a powerful, open-source software tool used for <a href="https://en.wikipedia.org/wiki/Computational_fluid_dynamics" target="_blank" rel="nofollow noopener noreferrer">computational fluid dynamics</a> (CFD) simulations. It allows engineers and scientists to model and analyze the flow of fluids, such as gases and liquids, through intricate geometries and physical phenomena. For example, such physical phenomena could be turbulence, heat transfer, and chemical reactions. OpenFOAM has a large and dedicated user base and is utilized in a variety of industries, including aerospace, automotive, chemical, energy, and marine engineering.</p> <p>This post focuses on the following challenges that users of OpenFOAM may encounter:</p> <ol> <li> <p><strong>Complexity</strong>: OpenFOAM is a highly flexible and powerful tool, but this can also make it difficult for new users to learn and navigate. The software has a large number of solvers and utilities, and it can be challenging to understand which solver is most suitable for a given problem.</p> </li> <li> <p><strong>Data management:</strong> OpenFOAM simulations generate a number of outputs that need to be stored, versioned, shared, and cleaned up when needed.</p> </li> <li> <p><strong>Interfacing with other software:</strong> OpenFOAM may need to be used in conjunction with other software, such as CAD or mesh generation tools, and there can be challenges in integrating these tools and transferring data between them.</p> </li> <li> <p><strong>Software version control:</strong> OpenFOAM and simulation software are constantly updating and very complex software packages.</p> </li> </ol> <p>All challenges above become more challenging for a small team of researchers who develop and run simulations. They may lack experience with DevOps and cloud Infrastructure management. Therefore, having a handy toolset is needed to help with pipelines and infrastructure setup.</p> <p>With DVC you may manage versions of simulation outputs, pipelines, and control software versions used to execute the pipeline ensuring consistent results. These features allow users to ensure that the new version of the software produces the same results as previous versions, helping to maintain the reliability and accuracy of the simulations. <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a> and <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> together provide a key for cloud resources management, running new experiments via nice UI, showing parameters and results of the simulation.</p> <p>We describe these and other features in the two following posts. In this post, we discuss how Iterative tools help with physical and computational simulations. To do this, we’ll go over a simple demo project built with OpenFOAM. The demo shows how to set up DVC for simulation experiments and data management.</p> <p>These posts are a result of collaboration between the <a href="http://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative.ai</a> and <a href="https://plasmasolve.com/about-us/" target="_blank" rel="nofollow noopener noreferrer">PlasmaSolve</a> teams. PlasmaSolve was founded in 2016 by plasma physicists and software engineers to provide a platform for cutting-edge physics simulation services and research. The PlasmaSolve team strives to deliver top-notch solutions and well-designed physics simulations to speed up research and reduce development costs using various open-source and commercial simulation tools.</p> <p><strong>In this post, you will learn how to:</strong></p> <ol> <li> <p>Configure and run OpenFOAM simulations with DVC</p> </li> <li> <p>Store and share simulation data in the cloud using DVC</p> </li> </ol> <h1 id="sonicfoam-simulation-pipeline" style="position:relative;"><code>sonicFoam</code> simulation pipeline<a href="#sonicfoam-simulation-pipeline" aria-label="sonicfoam simulation pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>OpenFOAM simulations may include several computational steps, from mesh generation to a large number of solvers and post-processing simulation results. SonicFoam is a simulation tool based on the open-source CFD (Computational Fluid Dynamics) software OpenFOAM. It is used to simulate compressible, inviscid flows with high Mach numbers, such as supersonic flows.</p> <p>In this demo, we simulate a supersonic flow over a step located at the front of the flow. The scenario involves a Mach 3 flow entering a rectangular area with a step near the inlet, which creates shock waves. We use the same geometry to run two chained simulations: <code>sonicFoam</code> and <code>scalarTransportFoam</code>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 531px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/988022d67d2d773683b934528aaf264f/39600/shock_fronts.png" alt="Shock fronts in the forward step problem" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Shock fronts in the forward step problem <a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">(source)</a></em></p> <p>Our demo simulation pipeline contains a few steps:</p> <ol> <li> <p>Generate geometry with <code>blockMesh</code>;</p> </li> <li> <p>Run <code>sonicFoam</code> simulation to get velocity (<code>U</code>) and temperature (<code>T</code>) fields;</p> </li> <li> <p>Post-processing simulation results;</p> </li> <li> <p>Run a subsequent <code>scalarTransportFoam</code> simulation that uses the velocity field computed before.</p> </li> </ol> <p>In reality, simulations sometimes need to be “chained”, i.e. outputs of one simulation go as an input to another simulation. When running a parametric study of such a simulation chain, intermediate simulations are often recomputed even if the parameter change does not influence them. We demonstrate how to use DVC to cache all the results and only trigger a computation if really necessary. Results of the <code>sonicFoam</code> solver go as inputs to the <code>scalarTransportFoam</code> solver.</p> <p>As a basis for the demo, we use OpenFOAM <a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">Supersonic flow over a forward-facing step tutorial</a>. The original code can be found <a href="https://develop.openfoam.com/Development/openfoam/tree/master/tutorials/compressible/sonicFoam/laminar/forwardStep" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="setup-the-demo-project" style="position:relative;">Setup the demo project<a href="#setup-the-demo-project" aria-label="setup the demo project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>💡 For this part of the post, we follow the <code>no-dvc</code> branch in the <a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/tree/no-dvc" target="_blank" rel="nofollow noopener noreferrer">demo repository</a>.</p> <p>The easiest way to follow the demo with OpenFOAM simulation is to run in <a href="https://www.docker.com/" target="_blank" rel="nofollow noopener noreferrer">Docker</a> containers. Follow the setup section in the repository <code>README</code> to build a Docker image and set up Python virtual environment and install dependencies.</p> <p>After the environment is set up we only need to run <code>openfoam-cse-docker</code> script which runs a new OpenFOAM job in a Docker container. For example, to run the OpenFOAM simulation in an interactive way, use the command:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker</span></code></pre></div> <h2 id="1-generate-geometry-with-blockmesh" style="position:relative;">1. Generate geometry with <code>blockMesh</code><a href="#1-generate-geometry-with-blockmesh" aria-label="1 generate geometry with blockmesh permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To use <code>sonicFoam</code>, a user must first create a 3D geometry model of the flow domain using a tool such as CAD software. The user must then define the boundary conditions and physical properties of the flow, such as the temperature, pressure, and velocity at each boundary. The user can then run the simulation using the <code>sonicFoam</code> solver, which will solve the governing equations of compressible flow using the finite volume method.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && blockMesh'</span></span></code></pre></div> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 460px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8604d3e8996de7144a3e3553f04fecff/39600/forward_step_geometry.png" alt="Geometry of the forward step" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Geometry of the forward step <a href="https://www.openfoam.com/documentation/tutorial-guide/3-compressible-flow/3.2-supersonic-flow-over-a-forward-facing-step" target="_blank" rel="nofollow noopener noreferrer">(source)</a></em></p> <h2 id="2-run-the-first-step-simulation-with-sonicfoam-solver" style="position:relative;">2. Run the first step simulation with <code>sonicFoam</code> solver<a href="#2-run-the-first-step-simulation-with-sonicfoam-solver" aria-label="2 run the first step simulation with sonicfoam solver permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>During the simulation, <code>sonicFoam</code> will calculate various flow quantities, such as the pressure, velocity, and temperature, at each point in the flow domain. The user can then visualize and analyze these results using post-processing tools, such as ParaView, to gain insight into the flow behavior.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && sonicFoam'</span></span></code></pre></div> <h2 id="3-post-processing-simulation-results" style="position:relative;">3. Post-processing simulation results<a href="#3-post-processing-simulation-results" aria-label="3 post processing simulation results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As an example of post-processing stages in the simulation demo, we have a few tasks:</p> <ul> <li> <p>calculate the magnitude of the velocity</p> </li> <li> <p>calculate <code>ﬂowRatePatch</code></p> </li> <li> <p>generate VTK and visualize mesh</p> </li> </ul> <p><strong>Calculate the magnitude of the velocity</strong></p> <p><code>postProcess</code> is a command allows users to perform post-processing operations on simulation data. The <code>-func</code> option specifies that a user-defined function should be applied to the data. In this case calculates and writes the field of the magnitude of velocity into a ﬁle named <code>mag(U)</code> in each time directory generated during simulation:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && postProcess -func "mag(U)"'</span></span></code></pre></div> <p>The <code>postProcess</code> command can be used in conjunction with various options and functions to perform a wide range of post-processing tasks, such as calculating flow quantities, generating plots, and creating animations. It is an important tool for gaining insight into the results of CFD simulations.</p> <p><strong>Calculate <code>ﬂowRatePatch</code></strong></p> <p>In order to produce a 1D dataset and its visualization we compute the flow rate over the “outlet” patch. For this purpose, we may apply the <code>flowRatePatch(name=outlet)</code> function to the simulation data. The <code>flowRatePatch</code> function calculates the flow rate through a patch, which is a specified boundary in the flow domain. The input <code>name</code> specifies the patch to use, in this case, <code>outlet</code>. The <code>outlet</code> patch represents the boundary at the outlet of the flow domain, so the <code>flowRatePatch</code> function will calculate the flow rate through the outlet.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && \ postProcess -func "flowRatePatch(name=outlet)"'</span></span></code></pre></div> <p>This operation saves results into the <code>sonicFoam/postProcessing/flowRatePatch(name=outlet)/0/surfaceFieldValue.dat</code> file.</p> <p><strong>Generate VTK</strong></p> <p><code>foamToVTK</code> is a utility converts simulation data stored in the OpenFOAM format to the VTK (<a href="https://vtk.org/about/#overview" target="_blank" rel="nofollow noopener noreferrer">Visualization ToolKit</a>) format. VTK is a popular file format for storing and visualizing scientific data, and it is often used for post-processing and visualization of CFD simulations.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd sonicFoam && foamToVTK'</span></span></code></pre></div> <p>This will convert the simulation data stored in the <code>sonicFoam</code> directory from the OpenFOAM format to the VTK format, allowing it to be visualized and analyzed using tools that support the VTK format. It creates <code>sonicFoam/VTK/</code> directory with formatted simulation results.</p> <h2 id="4-visualize-simulation-results" style="position:relative;">4. Visualize simulation results<a href="#4-visualize-simulation-results" aria-label="4 visualize simulation results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To visualize the results of a simulation performed using the OpenFOAM toolkit's <code>sonicFoam</code> solver, you can use one of the post-processing tools included with the OpenFOAM toolkit, such as <code>paraFoam</code> or <code>foamToVTK</code>. These tools allow you to view and analyze the simulation results in a graphical interface.</p> <p>In the demo example, a 3D geometry mesh and float pressure diagram are generated. There are examples of generated files below.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 532px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/98d6462c2a187d6cb6c1006d3b1fc196/39600/3d_mesh_viz.png" alt="3D mesh visualization" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>3D mesh visualization</em></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/461f191c90cc88d5c5ff70947230537c/39600/float_pressure_diag.png" alt="Float pressure diagram" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Float pressure diagram</em></p> <h2 id="5-run-the-second-step-simulation-with-scalartransportfoam-solver" style="position:relative;">5. Run the second step simulation with <code>scalarTransportFoam</code> solver<a href="#5-run-the-second-step-simulation-with-scalartransportfoam-solver" aria-label="5 run the second step simulation with scalartransportfoam solver permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The <code>scalarTransportFoam</code> is a solver in the open-source CFD software OpenFOAM that is used to solve a transport equation for a passive scalar using a specified stationary velocity field. It is typically used to calculate the convection diffusion of a scalar in a given velocity field.</p> <p>Before running <code>scalarTransportFoam</code> solver, we need to update the stage configuration based on the <code>sonicFoam</code> outputs:</p> <ul> <li> <p>Copy <code>U</code> config from the last simulation stage in <code>sonicFoam</code></p> </li> <li> <p>Update <code>T</code> config with the <code>boundaryField</code> from the last simulation stage in <code>sonicFoam</code></p> </li> <li> <p>Copy the <code>polyMesh</code> to use the same geometry</p> </li> </ul> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token comment"># Configure scalarTransportFoam</span> <span class="token line"><span class="token input">$ </span><span class="token command">python3</span> src/config_scalarTransportFoam.py </span> <span class="token comment"># Run scalarTransportFoam simulation</span> <span class="token line">$ ./openfoam-cse-docker <span class="token parameter variable">-c</span> <span class="token string">'cd scalarTransportFoam && scalarTransportFoam'</span></span></code></pre></div> <p>The simulation will calculate the transport of the passive scalar using the specified velocity field and other input parameters. The resulting simulation data can then be post-processed and analyzed to gain insight into the transport of the scalar in the flow.</p> <h1 id="reduce-simulation-management-complexity-with-dvc" style="position:relative;">Reduce simulation management complexity with DVC<a href="#reduce-simulation-management-complexity-with-dvc" aria-label="reduce simulation management complexity with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>💡 For this part of the post, we follow the <code>main</code> branch in the <a href="https://gitlab.com/iterative.ai/cse_public/sonicfoam-demo/-/tree/main" target="_blank" rel="nofollow noopener noreferrer">demo repository</a>. Please follow the README to prepare your environment and install dependencies.</p> <p>Up to this moment, we run different tasks for the simulation pipeline using separate commands. Let’s see how DVC tools can help with automating the simulation pipeline and handling simulation output data.</p> <p>DVC pipelines is a feature of the <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> (Data Version Control) tool. A DVC pipeline is a series of commands that are executed in a specific order and can be used to run all steps that are needed- simulation itself, post-processing the results, and generating reports. DVC automatically captures and tracks the data and code associated with your OpenFOAM simulations to make them reproducible and shareable with your team.</p> <h2 id="basic-computational-stage-configuration" style="position:relative;">Basic computational stage configuration<a href="#basic-computational-stage-configuration" aria-label="basic computational stage configuration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A DVC config file is written in YAML format and consists of a list of steps, each of which corresponds to a command that should be executed as part of the pipeline. The steps can depend on one another, meaning that the output from one step is used as input for another step. More details can be found on the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#stage-entries" target="_blank" rel="nofollow noopener noreferrer">DVC documentation website</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ba275bdbf4ec6069aa0f53d531f8bdbc/39600/dag.png" alt="DVC DAG" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Let’s consider an example of the DVC pipeline configuration for <code>blockMesh</code> stage below.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">blockMesh</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> bash run.sh 'cd sonicFoam <span class="token important">&&</span> blockMesh' <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> sonicFoam/system/blockMeshDict <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> sonicFoam/constant/polyMesh</code></pre></div> <p>The <code>cmd</code> field specifies the command to be executed, which in this case is a utility shell script <code>run.sh</code> that changes the file permissions and runs the <code>blockMesh</code> command directly or using <code>openfoam-cse-docker</code> script. The <code>run.sh</code> script “knows” how to run the simulations pipeline on your local environment (manually) or as a part of the GitLab CI pipeline on the Cloud environment (automatically). We will discuss CI configuration in later sections.</p> <p>The <code>deps</code> field in this pipeline step specifies the input files that the <code>blockMesh</code> command depends on <code>blockMeshDict</code> file. These files contain information about the mesh and the simulation parameters, and are required by the <code>blockMesh</code> command to generate the mesh.</p> <p>The <code>outs</code> field specifies the output files generated by the <code>blockMesh</code> command. In this case, the output is the <code>polyMesh</code> directory, which contains the generated mesh data. The mesh data is captured and versioned by DVC.</p> <h2 id="configure-simulation-pipelines-with-paramsyaml" style="position:relative;">Configure simulation pipelines with <code>params.yaml</code><a href="#configure-simulation-pipelines-with-paramsyaml" aria-label="configure simulation pipelines with paramsyaml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC pipeline configuration file (<code>params.yaml</code>) file configures an OpenFOAM simulation. Here is an extract of the parameters used for <code>sonicFoam</code> stage configuration:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">configureSim</span><span class="token punctuation">:</span> <span class="token key atrule">sim_config_dir</span><span class="token punctuation">:</span> configs <span class="token key atrule">controlDict</span><span class="token punctuation">:</span> <span class="token key atrule">path</span><span class="token punctuation">:</span> system/controlDict <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token key atrule">startTime</span><span class="token punctuation">:</span> <span class="token number">0</span> <span class="token key atrule">endTime</span><span class="token punctuation">:</span> <span class="token number">3</span> <span class="token key atrule">deltaT</span><span class="token punctuation">:</span> <span class="token number">0.002</span> <span class="token key atrule">writeInterval</span><span class="token punctuation">:</span> <span class="token number">0.5</span> <span class="token key atrule">purgeWrite</span><span class="token punctuation">:</span> <span class="token number">0</span> <span class="token key atrule">writePrecision</span><span class="token punctuation">:</span> <span class="token number">5</span> <span class="token key atrule">timePrecision</span><span class="token punctuation">:</span> <span class="token number">6</span></code></pre></div> <p>The <code>params</code> field of the <code>controlDict</code> section specifies the values of the simulation control parameters. In this case, the <code>startTime</code>, <code>endTime</code>, <code>deltaT</code>, <code>writeInterval</code>, <code>purgeWrite</code>, <code>writePrecision</code>, and <code>timePrecision</code> parameters are set to specific values.</p> <p>In the DVC simulation setup, the user is responsible for putting the values from the <code>params.yaml</code> file into the <code>controlDict</code>. Unlike other tools that handle this process automatically, this approach requires some manual effort on the user's end but provides greater flexibility as it eliminates the need for support for each and every tool or software used in the simulation. The demo showcases how this task is carried out through the <code>src/configureSim.py</code> script.</p> <h2 id="adapt-dvc-behavior-for-the-simulation-use-case" style="position:relative;">Adapt DVC behavior for the simulation use case<a href="#adapt-dvc-behavior-for-the-simulation-use-case" aria-label="adapt dvc behavior for the simulation use case permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC pipeline configuration expects that all inputs and outputs of each stage are explicitly defined in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. This is a common pattern in Machine Learning and Data Management pipelines. DVC uses explicit <code>deps</code> and <code>outs</code> to build a computational DAG and “understand” whether it needs to re-run a stage if some of its dependencies change. This ensures the reproducibility of the pipeline.</p> <p>However, OpenFOAM simulation pipelines are different. Depending on the simulation parameters (e.g. <code>endTime</code> and <code>writeInterval</code> in the <code>controlDict</code> parameters), a different number of files and folders can be generated. Therefore, it may impossible to specify all outputs in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in advance. But, because of these files are not specified in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, DVC can’t manage them properly. To solve this problem, we introduced two helper scripts that “help” DVC to find and handle generated files and folders for the simulation use case. Hopefully, <a href="https://github.com/iterative/dvc/issues/4816" target="_blank" rel="nofollow noopener noreferrer">supporting wildcard patterns</a> in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> configuration file will simplify such use cases!</p> <p>Let’s introduce two additional helper scripts:</p> <ul> <li><code>dvc_outs_remove.py</code> - removes the stage outputs from the previous simulation. This script checks if there are files previously added by <code>dvc_outs_handler.py</code> script and remove them from DVC with <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a> command.</li> <li><code>dvc_outs_handler.py</code> - finds all “untracked” and adds them to DVC control. By default, only files tracked by either Git or DVC are saved to the experiment. This script checks if there are files or directories generated by the stage and add them to DVC with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> command.</li> </ul> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">sonicFoam</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token comment"># Remove previous sim results</span> <span class="token punctuation">-</span> python3 src/dvc_outs_remove.py <span class="token punctuation">-</span><span class="token punctuation">-</span>stage=sonicFoam <span class="token punctuation">...</span> <span class="token comment"># Run sim</span> <span class="token punctuation">-</span> bash run.sh 'cd sonicFoam <span class="token important">&&</span> sonicFoam' <span class="token comment"># Add generated files to DVC and create outputs index files</span> <span class="token punctuation">-</span> python3 src/dvc_outs_handler.py <span class="token punctuation">-</span><span class="token punctuation">-</span>stage=sonicFoam <span class="token punctuation">...</span> <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> configureSim <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> sonicFoam/constant/polyMesh/ <span class="token punctuation">-</span> <span class="token punctuation">...</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token punctuation">...</span></code></pre></div> <h2 id="link-stages-and-multiple-solvers" style="position:relative;">Link stages and multiple solvers<a href="#link-stages-and-multiple-solvers" aria-label="link stages and multiple solvers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>It is common for OpenFOAM simulations to involve complex pipelines with multiple steps and dependencies between the steps. This is because simulations often require the use of multiple solvers, each of which may have its own input and output files and dependencies on other solvers.</p> <p>For example, a simulation may require the use of multiple solvers to simulate different physical phenomena, such as fluid flow, heat transfer, and chemical reactions. These solvers may need to be run in a specific order and may depend on the output of other solvers as input.</p> <p>It’s possible to manage these dependencies with DVC! DVC allows you to specify the steps in the simulation pipeline and the dependencies between them in a configuration file.</p> <p>The demo project example has two solvers: <code>sonicFoam</code> and <code>scalarTransportFoam</code>. Both solvers depend on the same geometry generated by the <code>blockMesh</code> stage. In the case we know exactly the path to the output (<code>outs</code>) of the <code>sonicFoam</code> solver, we may explicitly define it as a dependency (<code>deps</code>) of the <code>scalarTransportFoam</code> stage. In our case, we use a utility script (<code>src/config_scalarTransportFoam.py</code>) to get the results of the <code>sonicFoam</code> solver and prepare the initial state for the <code>scalarTransportFoam</code> solver.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">scalarTransportFoam</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> python3 src/config_scalarTransportFoam.py <span class="token punctuation">-</span> <span class="token punctuation">...</span> <span class="token punctuation">-</span> bash run.sh 'cd scalarTransportFoam <span class="token important">&&</span> scalarTransportFoam' <span class="token punctuation">-</span> <span class="token punctuation">...</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> sonicFoam/constant/polyMesh/ <span class="token punctuation">-</span> <span class="token punctuation">...</span> <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> plotMesh <span class="token punctuation">-</span> scalarTransportFoam <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token punctuation">...</span></code></pre></div> <h2 id="run-a-new-simulation" style="position:relative;">Run a new simulation<a href="#run-a-new-simulation" aria-label="run a new simulation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>After the DVC pipeline is set up, you may run a new simulation experiment with a command:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>To run a new simulation with updated parameters you may manually change the parameter value in the <code>params.yaml</code> file and run <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> or, it’s possible to <a href="https://dvc.org/doc/command-reference/exp/run#example-modify-parameters-on-the-fly" target="_blank" rel="nofollow noopener noreferrer">modify parameters on-the-fly</a>. For example, let’s change the length of our simulation:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'configureSim.controlDict.params.endTime=4'</span></span></code></pre></div> <p>It is also possible to queue and run multiple simulations in parallel.</p> <p>In the next post, we will show how to visualize and compare simulation data with CML and Iterative Studio.</p> <h1 id="versioning-and-sharing-simulation-data-with-dvc" style="position:relative;">Versioning and sharing simulation data with DVC<a href="#versioning-and-sharing-simulation-data-with-dvc" aria-label="versioning and sharing simulation data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Effective data management is essential for successful OpenFOAM simulations. Proper data management can help you organize and track the data and code associated with your simulations, and make it easier to reproduce simulation results.</p> <p>There are several challenges that users of OpenFOAM may encounter in managing the data associated with their simulations:</p> <ol> <li> <p><strong>Large data volumes</strong>: OpenFOAM simulations can generate large amounts of data, particularly for complex or high-resolution simulations. This can make it difficult to store, transfer, and analyze the data effectively.</p> </li> <li> <p><strong>Data version control</strong>: It is important for users to be able to track changes to the input files and simulation results over time and to be able to reproduce past simulations. This can be challenging without a version control system or other means of tracking changes.</p> </li> <li> <p><strong>Data transfer</strong>: Users may need to transfer large amounts of data between different systems or devices, such as between their personal computers and a high-performance computing cluster. This can be challenging due to the size of the data and the potential for data transfer bottlenecks.</p> </li> <li> <p><strong>Collaboration</strong>: Users may want to share simulation results with colleagues or collaborate on simulations. This can be done by sharing the simulation input files and results, as well as using tools such as online collaborative platforms or version control systems.</p> </li> </ol> <p>Luckily, DVC may help with all of them. Let’s review the core features of DVC that we used in the demo project. <a href="https://dvc.org/doc/use-cases/versioning-data-and-models" target="_blank" rel="nofollow noopener noreferrer">Data versioning</a> is a core feature of DVC that helps to capture the versions of simulation data in Git commits, while storing them on-premises or in cloud storage. Moreover, using DVC pipelines, all outputs specified as <code>outs</code>, <code>plots</code>, or <code>metrics</code> in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> configuration, are automatically added to DVC version control! Other files, generated by different stages, are added to DVC via <code>dvc_outs_handler.py</code> script. The next step is to set up DVC remote storage and upload these files there.</p> <p>DVC help to store large volumes of data in the on-premise or cloud storage (e.g. SSH, S3, HDFS, <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">etc.</a>) The demo project uses AWS S3 as a remote storage. For more details on the remote storage configuration you may check <a href="https://dvc.org/doc/command-reference/remote#example-customize-an-additional-s3-remote" target="_blank" rel="nofollow noopener noreferrer">Example: Customize an additional S3 remote</a>.</p> <p>You may add your own remote storage in AWS S3 bucket using the following command:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> s3remote url s3://<span class="token operator"><</span>bucket<span class="token operator">></span>/<span class="token operator"><</span>path<span class="token operator">></span></span></code></pre></div> <p>After the remote storage is set up, you need a single additional command to transfer your results to the storage:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp push</span></span></code></pre></div> <p>With this DVC takes care of pushing and pulling to/from both Git and DVC remotes in the case of experiments. Therefore, the following collaboration with colleagues is simple. Your colleagues may access your last simulation results with a <a href="https://dvc.org/doc/command-reference/exp/pull"><code>dvc exp pull</code></a> command (after updating their repository with <code>git pull</code>):</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span></span></code></pre></div> <h1 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>This post details how Iterative tools help in physical and computational simulations. The demo shows how to set up DVC for simulation experiments and data management.</p> <p>Overall, DVC can help OpenFOAM users to:</p> <ol> <li> <p>Reduce the complexity of simulation pipelines and automate tasks such as running simulations, post-processing results, and generating reports.</p> </li> <li> <p>Manage and track the data and code associated with your OpenFOAM simulations, and make it easier to reproduce simulation results.</p> </li> <li> <p>Manage simulation experiments with a YAML config files.</p> </li> <li> <p>Store and share simulation data in the cloud using DVC and AWS S3.</p> </li> <li> <p>Easily collaborate with your colleagues around simulation results, share and reuse data.</p> </li> </ol> <p>In the next post, we will discuss how to utilize cloud computing resources and visualize and compare simulation data with CML and Iterative Studio.</p>https://dvc.org/blog/automate-your-ml-pipeline-combining-airflow-dvc-and-cml-for-a-seamless-batch-scoring-experiencehttps://dvc.org/blog/automate-your-ml-pipeline-combining-airflow-dvc-and-cml-for-a-seamless-batch-scoring-experienceWed, 22 Mar 2023 00:00:00 GMT<p>Companies in Banking, Telecom, Retail, and other industries operate the enormous size of data to generate insights and gain value. <a href="https://www.datarobot.com/wiki/scoring/" target="_blank" rel="nofollow noopener noreferrer">Batch scoring</a> is a common way to operate machine learning applications for such companies. It helps to run ML training and inference (scoring) jobs that operate with large amounts of data. This post covers topics around the design, tools, and implementation of ML applications for batch scoring scenarios with Airflow.</p> <h3 id="what-is-batch-scoring" style="position:relative;">What is batch scoring?<a href="#what-is-batch-scoring" aria-label="what is batch scoring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In machine learning, scoring is the process of applying a trained model to a new dataset in an attempt to get practical predictions. Batch scoring is the way to score (get predictions) for large datasets that are collected over some period of time before being passed to the model. It is the most effective scoring pattern when the model’s decisions don’t have to be implemented immediately. For example, a CRM Department in Retail Banking may apply ML models to a batch of active customers to determine which are most likely to buy a new credit product next month. Other application examples:</p> <ul> <li> <p><strong>Marketing Communication Optimization:</strong> effectively identifying customers who are looking for new financial products and services, and then optimizing marketing communication, is a perfect application for AI. This use case includes not only identifying customers with a propensity to buy new products, but also customers at risk of churning.</p> </li> <li> <p><strong>Pricing Optimization:</strong> personalization of banking services requires monitoring the marketplace dynamically to provide competitive prices for existing and new customers.</p> </li> <li> <p><strong>Next Best Action (NBA):</strong> this is a promising customer-centric approach to optimize multiple different actions that could be taken for a specific customer through multiple communication channels.</p> </li> </ul> <h3 id="goals-for-this-post" style="position:relative;">Goals for this post<a href="#goals-for-this-post" aria-label="goals for this post permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This post shares an approach to solve 3 tasks in batch scoring applications:</p> <ul> <li> <p>Build an ML pipeline to train a model.</p> </li> <li> <p>Setup a <code>train</code> CI job to run a model training at scale.</p> </li> <li> <p>Setup a <code>deploy</code> CI job to deliver the inference (scoring) pipeline to an Airflow cluster.</p> </li> </ul> <h3 id="how-to-reproduce" style="position:relative;">How to reproduce<a href="#how-to-reproduce" aria-label="how to reproduce permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Code examples are stored in two repositories:</p> <ul> <li> <p><a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default" target="_blank" rel="nofollow noopener noreferrer">home_credit_default</a> contains an end-to-end solution for a batch scoring application with Airflow</p> </li> <li> <p><a href="https://gitlab.com/iterative.ai/cse_public/airflow-cluster" target="_blank" rel="nofollow noopener noreferrer">airflow-cluster</a> contains configuration for Airflow and other services</p> </li> </ul> <p>Fork the <a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default" target="_blank" rel="nofollow noopener noreferrer">home_credit_default</a> repository if you'd like to replicate our steps and deploy your own batch-scoring application with Airflow and DVC. Keep in mind that you'll need the setup and to configure the following:</p> <ul> <li> <p>GitLab account and <a href="https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html" target="_blank" rel="nofollow noopener noreferrer">Personal Access Token</a>.</p> </li> <li> <p><a href="https://pipenv.pypa.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer"><code>pip</code></a> and Docker installed locally</p> </li> </ul> <p>The repository also contains code for Airflow DAGs, which can be found in the <code>dags/</code> directory. A separate <a href="https://gitlab.com/iterative.ai/cse_public/airflow-cluster" target="_blank" rel="nofollow noopener noreferrer">airflow-cluster</a> repository is used to set up and run the Airflow cluster.</p> <h2 id="design-ml-pipelines-with-dvc" style="position:relative;">Design ML pipelines with DVC<a href="#design-ml-pipelines-with-dvc" aria-label="design ml pipelines with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Machine Learning experiment pipelines for batch-scoring applications typically involve the following steps:</p> <ol> <li> <p><strong>Data preparation:</strong> The first step is to clean, pre-process, and transform the data into a format that can be used for training machine learning models.</p> </li> <li> <p><strong>Feature engineering:</strong> In this step, relevant features are extracted or created from the data and transformed into a format that can be used for training machine learning models.</p> </li> <li> <p><strong>Model selection and training:</strong> Next, multiple machine learning models are selected and trained using the prepared data.</p> </li> <li> <p><strong>Model evaluation:</strong> The trained models are then evaluated to determine their accuracy and performance on new data.</p> </li> </ol> <p>By following these steps, the pipeline provides a systematic approach to experimenting with different machine learning models, including feature engineering, and selecting the best one for deployment.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/082baaa38ed94173d19d9a198cbbbda1/39600/diagram.png" alt="DVC Pipeline Design" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Machine Learning experiment pipelines for batch scoring applications</em></p> <p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> is a great tool that can help to automate such kinds of ML pipelines. For the purpose of this tutorial, the DVC pipeline consists of five stages (see <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in <a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default" target="_blank" rel="nofollow noopener noreferrer">the example repo</a>):</p> <ul> <li> <p>Load Data (<code>load_data</code>)</p> </li> <li> <p>Calculate features for <code>bureau.csv</code> data (<code>extract_features_bureau</code>)</p> </li> <li> <p>Calculate features for <code>application.csv</code> data (<code>extract_features_application</code>)</p> </li> <li> <p>Join features (<code>join_features</code>)</p> </li> <li> <p>Train and save a model (<code>train</code>)</p> </li> </ul> <p>The diagram below visualizes dependencies between stages of the DVC pipeline. For such patterns, DVC helps automatically track changes and optimize the time to run the pipeline. For example, if you iteratively improve only code to calculate features for Bureau data, DVC will only rerun 3 stages: <code>extract_features_bureau</code>, <code>join_features</code>, and <code>train</code>. DVC with skip running <code>load data</code> and <code>extract_features_application</code> because these steps did not change, saving a substantial amount of time.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2e7610eb51bc6541433ff3bb99bc7c87/39600/dependency-diagram.png" alt="Dependency diagram" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Dependency Diagram</em></p> <p>After we prepare the configuration for the ML pipeline, DVC helps to run a new model training experiment with a simple single command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>Or, if you want to update the configuration of the <code>params.yaml</code> file and set a specific name of the experiment you may run a command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-n</span> <span class="token operator"><</span>NAME<span class="token operator">></span> <span class="token punctuation">[</span>--set-param <span class="token operator"><</span>param_name<span class="token operator">>=</span><span class="token operator"><</span>param_value<span class="token operator">></span><span class="token punctuation">]</span></span></code></pre></div> <h2 id="train-model-at-scale-with-studio-and-cml" style="position:relative;">Train model at scale with Studio and CML<a href="#train-model-at-scale-with-studio-and-cml" aria-label="train model at scale with studio and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In a common scenario, batch-scoring applications require a large amount of data stored in remote storage. Data Scientists run ML experiments on a local (dev) machine (e.g. laptop) using a sample of the data. After the model and hyperparameters configuration are found, an additional training run on the full dataset is required. Sometimes, the final model training is run on a different high-performance machine. Results for the ML experiments should be stored and accessible for the next analysis, following experiments, and any team members that need to review them.</p> <h3 id="continuous-integration-ci-workflow" style="position:relative;">Continuous Integration (CI) workflow<a href="#continuous-integration-ci-workflow" aria-label="continuous integration ci workflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Designing a CI (Continuous Integration) job to run model training at scale involves the following steps:</p> <ol> <li> <p><strong>Environment setup:</strong> Create a reproducible environment for model training by using virtual machines or containers. GitLab and <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> help us preparing and provisioning an environment for the training job.</p> </li> <li> <p><strong>Automated build:</strong> Set up an automated build process that triggers a build every time code is committed to the repository. We use GitLab CI configuration to automate building a Docker image and run tests for the code.</p> </li> <li> <p><strong>Parallel processing:</strong> Utilize parallel processing to run multiple model training jobs in parallel. This reduces the time required to train the model and can be accomplished using tools like Dask or Ray. In this example, we don’t use these tools.</p> </li> <li> <p><strong>Training:</strong> Make sure that the model training pipeline can scale to handle large amounts of data and processing power. As a result of the training job, a new model is saved. CML may help to set up and use cloud computing resources or by using high-performance computing systems.</p> </li> </ol> <p>GitLab's Continuous Integration (CI) pipeline configuration for this post example is stored in the <a href="https://gitlab.com/iterative.ai/cse_public/home_credit_default/-/blob/main/.gitlab-ci.yml" target="_blank" rel="nofollow noopener noreferrer"><code>.gitlab-ci.yml</code> file</a>. It specifies different stages of the pipeline including building an image, testing the code, training a model, and deploying Airflow DAGs. The image below provides a graphical representation of this pipeline.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/148f86f382ad2e8f6003b216ba489a7d/39600/gitlab-airflow.png" alt="GitLab Continuous Integration Pipeline Configuration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>GitLab Continuous Integration Pipeline Configuration with Airflow Cluster</em></p> <ul> <li> <p>The GitLab repository triggers the CI pipeline as soon as new code or parameters updates are committed to the repository. This runs <code>build</code>, <code>test</code>, and <code>train</code> CI jobs. The <code>train</code> job runs a model training on a full dataset on a remote machine (or cloud), generates model training reports, and creates a PR in the GitLab repo.</p> </li> <li> <p>Merging (accepting a pull/merge request) the experiment results into the <code>main</code> branch triggers the <code>deploy</code> job.</p> </li> <li> <p>Every month, Airflow runs <code>scoring</code> jobs to generate predictions (scores) for all clients on new data. Generated predictions are stored in the prediction database or files.</p> </li> </ul> <h3 id="setup-train-job-with-gitlab-ci-and-cml" style="position:relative;">Setup <code>train</code> job with GitLab CI and CML<a href="#setup-train-job-with-gitlab-ci-and-cml" aria-label="setup train job with gitlab ci and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>For this post’s example the training job is triggered on creating a new Merge Request into the <code>main</code> branch or, if the Git commit message (commit to any branch) contains the <code>[exp]</code> tag. This configuration allows us to achieve two goals:</p> <ol> <li> <p>We may define whether new code (or params) changes need to trigger a new experiment, or if it’s just a minor update (e.g. update the documentation in README) there is no need to run a new experiment,</p> </li> <li> <p>We ensure that every merge into the <code>main</code> branch is linked to the latest model.</p> </li> </ol> <p>An example of the <code>train</code> job configuration is presented below. There are three main steps in the <code>script</code> there:</p> <ol> <li> <p>Run a new experiment on a full-scale dataset with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a></p> </li> <li> <p>Prepare the <code>report.md</code> file with metrics and plots,</p> </li> <li> <p>Publish the <code>report.md</code> content to the Merge Request (Pull Request) message in GitLab (<a href="https://cml.dev/doc/ref/publish" target="_blank" rel="nofollow noopener noreferrer">using CML</a>).</p> </li> </ol> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token punctuation">...</span> <span class="token key atrule">rules</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">if</span><span class="token punctuation">:</span> $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main" <span class="token punctuation">|</span><span class="token punctuation">|</span> $CI_COMMIT_MESSAGE =~ /\<span class="token punctuation">[</span>exp\<span class="token punctuation">]</span>/ <span class="token key atrule">image</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span>PROJECT_IMAGE<span class="token punctuation">}</span> <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token punctuation">...</span> <span class="token punctuation">-</span> dvc exp run <span class="token punctuation">-</span><span class="token punctuation">-</span>pull <span class="token punctuation">-</span>S load_data.sample_size=1.0 <span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string"> echo "# Metrics" >> report.md echo "## Experiment metrics" >> report.md dvc metrics show --show-md >> report.md ... echo "## Plot train lift curve " >> report.md echo '![](./reports/lift_curve_train.png "Train lift curve")' >> report.md</span> <span class="token punctuation">-</span> cml pr create . <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token punctuation">-</span> cml comment create <span class="token punctuation">-</span><span class="token punctuation">-</span>target=commit report.md</code></pre></div> <h3 id="run-ml-experiments-with-iterative-studio" style="position:relative;">Run ML experiments with Iterative Studio<a href="#run-ml-experiments-with-iterative-studio" aria-label="run ml experiments with iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The proposed CI pipeline makes it possible to implement a development process that:</p> <ul> <li> <p>Automates the launch of experiments with training models when the code changes.</p> </li> <li> <p>Links the change in versions of the code and artifacts (models, data).</p> </li> <li> <p>Makes the development more straightforward and manageable.</p> </li> </ul> <p>Moreover, it enables <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> to run new experiments from the UI.</p> <p>The experimenting process of Iterative Studio is very simple! (See diagram below).</p> <ul> <li> <p>In the first step (1), we update the experiment configuration and trigger running a new experiment. This functionality is available in the standard package of the Iterative Studio.</p> </li> <li> <p>Then (2) the configured GitLab CI pipeline launches the experiment job running.</p> </li> <li> <p>After the job completes, CML publishes the experiment report to GitLab commit message (3). Iterative Studio is constantly monitoring the project repository for updates.</p> </li> <li> <p>As soon as the repo changes, Iterative Studio updates tracking files in the UI (4). Data Scientists can compare experiment metrics and plots.</p> </li> <li> <p>Also, DVC stores the updated versions of a model and artifacts to DVC Storage (5).</p> </li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d823204577287d5aaa66392eac57679f/39600/trigger-experiment.png" alt="GitLab Continuous Integration Pipeline Configuration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>GitLab Continuous Integration Pipeline Configuration with Airflow Cluster</em></p> <p>After the experiment completes Iterative Studio helps to visualize parameters, metrics, and plots. Users may compare experiments, run new ones, and share with colleagues.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1ec8814c2de0e8238ab27cc1d62ad0da/39600/confusion-matrix.png" alt="Confusion Matrices" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Visualize Parameters, Metrics and Plots in Iterative Studio</em></p> <h2 id="deploy-scoring-pipeline" style="position:relative;">Deploy scoring pipeline<a href="#deploy-scoring-pipeline" aria-label="deploy scoring pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A batch scoring inference pipeline in machine learning is a series of steps that are executed in a specific order to process a large amount of data and generate predictions based on a pre-trained model. It typically includes the following steps:</p> <ol> <li> <p><strong>Input data preparation:</strong> This step involves cleaning, transforming, and preprocessing the input data so that it can be fed into the model for prediction. Feature engineering can be a part of this step.</p> </li> <li> <p><strong>Model loading:</strong> The pre-trained model is loaded into memory, usually from storage or a database, so that it can be used for predictions.</p> </li> <li> <p><strong>Inference:</strong> The input data is passed through the model to generate predictions. This is done in a batch-wise manner, where a large amount of data is processed in one go to reduce the overhead of repetitively loading the model.</p> </li> <li> <p><strong>Post-processing:</strong> This step involves any additional processing of the prediction results, such as normalization, thresholding, or aggregation, before they are written to an output file or database.</p> </li> <li> <p><strong>Saving predictions:</strong> Finally, the prediction results are saved to a file or database for further analysis or use. This can be done in various formats, such as CSV, JSON, or binary.</p> </li> </ol> <p>The pipeline can be implemented using a variety of tools and technologies such as Apache Airflow, Apache Spark, or even custom scripts. The key aspect of a scoring pipeline is that it is automated, efficient, and scalable, making it possible to score large volumes of data in a timely and consistent manner.</p> <p>Because of the large number of pre- and post-processing tasks, including checking for data sources updates, the typical scenario needs to deploy a scoring pipeline, not a model.</p> <h3 id="batch-scoring-inference-pipeline-with-airflow" style="position:relative;">Batch scoring inference pipeline with Airflow<a href="#batch-scoring-inference-pipeline-with-airflow" aria-label="batch scoring inference pipeline with airflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In this example, we implement an inference pipeline using <a href="https://airflow.apache.org/" target="_blank" rel="nofollow noopener noreferrer">Apache Airflow</a>. Airflow helps to schedule and run pipelines (DAG) for various data engineering and machine learning purposes. DAG is a Directed Acyclic Graph that describes an order of computational Tasks (jobs) to run. The basics of the Airflow pipeline definition can be found <a href="https://airflow.apache.org/docs/apache-airflow/stable/start.html" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <p>We store Airflow DAGs in the <code>dags/</code> directory in the same repository as our ML pipeline.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3ac81fde4d1b093c3cb70d812dae65a/39600/dags.png" alt="DAG" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DAGs Directory</em></p> <p>Let’s go a bit deeper into the Airflow DAG <code>dags/scoring.py</code> to find out how DVC is used there! This DAG is designed to be run every 5th day of the month to calculate predictions and save them into a .csv file.</p> <p>The DAG performs the following steps:</p> <ol> <li> <p>It creates a temporary directory for the local repository (<strong><code>create_tmp_dir</code></strong> task).</p> </li> <li> <p>It clones the repository specified in the <strong><code>project_args</code></strong> argument (<strong><code>clone</code></strong> task).</p> </li> <li> <p>It runs the scoring script from the cloned repository and saves predictions (<strong><code>run_scoring</code></strong> task).</p> </li> <li> <p>Finally, it removes the temporary repository directory (<strong><code>clean</code></strong> task).</p> </li> </ol> <p>For the purposes of this post, we are most interested in the <code>run_scoring</code> task! The task 'run_scoring' is a BashOperator in Apache Airflow. It performs the following actions:</p> <ol> <li> <p>Runs the <a href="https://dvc.org/doc/command-reference/fetch"><code>dvc fetch</code></a> command to fetch the latest version of the artifacts and model to be used for inference.</p> </li> <li> <p>Runs the <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> command to check out the latest version of the data.</p> </li> <li> <p>Runs a python script located at <code>src/stages/scoring.py</code> with the following command line arguments:</p> <ul> <li> <p><code>--config</code> specifies the path to the parameters file in YAML format,</p> </li> <li> <p><code>--scoring-date</code> specifies the date for which the scoring should be performed,</p> </li> <li> <p><code>--storage-path</code> specifies the location of the storage.</p> </li> </ul> </li> </ol> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">run_scoring <span class="token operator">=</span> BashOperator<span class="token punctuation">(</span> task_id<span class="token operator">=</span><span class="token string">'run_scoring'</span><span class="token punctuation">,</span> bash_command<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f''</span></span>' cd <span class="token punctuation">{</span>project_args<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'dag_run_dir'</span><span class="token punctuation">)</span><span class="token punctuation">}</span> <span class="token operator">&</span><span class="token operator">&</span> \ export PYTHONPATH<span class="token operator">=</span><span class="token punctuation">.</span> k<span class="token operator">&</span><span class="token operator">&</span> \ dvc fetch <span class="token operator">&</span><span class="token operator">&</span> \ dvc checkout <span class="token operator">&</span><span class="token operator">&</span> \ python src<span class="token operator">/</span>stages<span class="token operator">/</span>scoring<span class="token punctuation">.</span>py \ <span class="token operator">-</span><span class="token operator">-</span>config<span class="token operator">=</span>params<span class="token punctuation">.</span>yaml \ <span class="token operator">-</span><span class="token operator">-</span>scoring<span class="token operator">-</span>date<span class="token operator">=</span><span class="token punctuation">{</span><span class="token punctuation">{</span><span class="token punctuation">{</span><span class="token punctuation">{</span> first_day_of_month<span class="token punctuation">(</span>ds<span class="token punctuation">)</span> <span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">}</span> \ <span class="token operator">-</span><span class="token operator">-</span>storage<span class="token operator">-</span>path<span class="token operator">=</span><span class="token punctuation">{</span>project_args<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'storage_path'</span><span class="token punctuation">)</span><span class="token punctuation">}</span> \</code></pre></div> <p>Therefore, this example shows the deployment of the Airflow DAGs, and DVC helps to fetch the latest model to be used for inference. This is awesome!</p> <h3 id="setup-ci-job-deploy" style="position:relative;">Setup CI job <code>deploy</code><a href="#setup-ci-job-deploy" aria-label="setup ci job deploy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <aside> 💡 Merging (accepting PR) experiment results into the `main` branch triggers the `deploy` job. </aside> <p>There are various strategies for delivering <code>scoring</code> DAG to the Airflow cluster. In this example, the GitLab CI pipeline pushes (copies) DAG files from the repo to the Airflow home directory (specified by <code>${AIRFLOW_HOME}</code>) and activates it.</p> <p>The <code>deploy_dags</code> CI job configuration looks like this:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">deploy_dags</span><span class="token punctuation">:</span> <span class="token key atrule">stage</span><span class="token punctuation">:</span> deploy <span class="token punctuation">...</span> <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string"> export DAGS_FOLDER=${AIRFLOW_HOME}/dags/${PROJECT_FOLDER}</span> <span class="token comment"># Create ${DAGS_FOLDER}</span> rm <span class="token punctuation">-</span>rf $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span> <span class="token important">&&</span> mkdir <span class="token punctuation">-</span>p $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span> <span class="token comment"># Copy content of folder ./dags to ${DAGS_FOLDER} directory</span> cp <span class="token punctuation">-</span>r dags/* $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span> echo "Airflow DAGs copied to $<span class="token punctuation">{</span>DAGS_FOLDER<span class="token punctuation">}</span>"</code></pre></div> <p>This simple example is for demonstration purposes, but it works as a proof-of-concept for DVC-Airflow-Studio integration for batch scoring applications.</p> <h2 id="results" style="position:relative;">Results<a href="#results" aria-label="results permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The proposed approach demonstrates how DVC, CML, and Iterative Studio may help in batch scoring applications at the experimentation and production phases. Solutions discussed in this post may benefit similar use cases in a few ways:</p> <ul> <li> <p>Help with system design and tools integration.</p> </li> <li> <p>Automate ML experiments.</p> </li> <li> <p>Increasing speed of Proof-Of-Concept (POC) and Operationalization (MLOps) stages.</p> </li> <li> <p>Saving time and money for similar projects.</p> </li> </ul> <p>Specifically, DVC and Iterative Studio can benefit batch scoring Applications by:</p> <ul> <li> <p>Enabling regulatory compliance and auditability. Iterative Studio offers a robust approach for data usage tracking, keeping, and versioning data and configurations used for model training and prediction. Models are developed in a robust environment allowing us to link code, data, and configs for reproducible experiments and ensure auditability in the event of a compliance audit.</p> </li> <li> <p>Run machine learning experiments, with or without coding. Iterative Studio offers a user-friendly UI for analysts and data scientists to create a new experiment, change the configuration, and run with a one-button-click.</p> </li> <li> <p>Access versioned models during the CI/CD process and use them to run a scoring job with Airflow.</p> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> </li> </ul> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/cloud-versioninghttps://dvc.org/blog/cloud-versioningWed, 22 Feb 2023 00:00:00 GMT<p>If you use cloud storage regularly, you have probably seen it become a mess like this S3 bucket:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8f24ac8a282698234abe9b2368846be2/39600/no_versions.png" alt="no versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Luckily, major cloud storage providers can version files automatically. Still, even with versioning enabled, you might find you end up with a mess. More importantly, you forget which version is which.</p> <p>That's because versioning happens at the file level. There's no way to version a composite dataset or entire machine learning project. This is where DVC can supplement cloud versioning and finally let you clean up your cloud storage. DVC records the versions of all the files in your dataset, so you have a complete snapshot of each point in time. You can store this record in Git alongside the rest of your project and use it to recover the data from that time, giving you the freedom to keep adding new data in place without fear of losing track of the old data. DVC ensures reproducibility while keeping everything organized between your Git repo and cloud storage, so you can focus on iterating on your machine learning project.</p> <admon type="info"> <p>If you already use DVC, you might be familiar with data versioning and want to know what DVC cloud versioning means for you. Read the next section to get more familiar with cloud versioning generally or skip directly to the section <a href="#for-existing-dvc-users">for existing DVC users</a>.</p> </admon> <h1 id="how-cloud-versioning-works" style="position:relative;">How cloud versioning works<a href="#how-cloud-versioning-works" aria-label="how cloud versioning works permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>With versioning enabled, whenever you save a file to the cloud, it will get a unique version ID. When you overwrite (or even delete) a file, the previous version remains accessible by referencing its version ID.</p> <p>Here's the same data from above organized with cloud versioning:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/90dc198bc1858e98b9b1aabc314958be/39600/show_versions.png" alt="show versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Overwritten and deleted files may be recovered using their version IDs.</em></p> <p>And here it is showing only the current versions:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7f12527c5ca6a17e2200dc6756d58678/39600/collapsed_versions.png" alt="collapsed versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Enabling versioning can keep your cloud storage organized by collapsing file versions.</em></p> <p>Now the model versions are all collapsed under one file name and ordered by time, but what about the <code>predictions</code> folder? Let's assume this project trains a neural machine translation model, and each file in <code>predictions</code> is a predicted translation of a sentence. Each model iteration generates a new set of predictions. How can we reassemble the predictions from an earlier model version?</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/427c01d0e46fa7a15d31f51f544d406a/39600/dir_versions.png" alt="dir versions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>For a folder of many files, keeping track of versions becomes unrealistic.</em></p> <h1 id="how-dvc-works-with-cloud-versioning" style="position:relative;">How DVC works with cloud versioning<a href="#how-dvc-works-with-cloud-versioning" aria-label="how dvc works with cloud versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Cloud versioning falls short for tracking and syncing folders and projects, but this is where DVC can help. DVC records the version IDs of all files in your dataset or project. You keep this record in a Git repository so you can maintain snapshots of your cloud-versioned data (the data itself gets stored on the cloud, not in Git).</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9de9366bfc669d5157b3bd8e6ba4d152/39600/dir_versions_dvc.png" alt="dir versions dvc" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC connects multiple version IDs across a folder or project.</em></p> <admon type="tip"> <p>Before you start with DVC, ensure that your cloud storage is configured correctly. Cloud versioning must be enabled at the bucket or storage account level. See <a href="#quickstart">Quickstart</a> for instructions below if versioning is not already enabled. You also need write access to the cloud storage (more info on how to configure your storage <a href="https://dvc.org/doc/user-guide/data-management/remote-storage" target="_blank" rel="nofollow noopener noreferrer">here</a>).</p> </admon> <p>To start using cloud versioning in DVC, <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">install</a> DVC and set up a <code>version_aware</code> remote inside a Git repo. A remote is the cloud storage location where you want to sync the data, and <code>version_aware</code> tells DVC to use cloud versioning.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span> </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">--default</span> myremote s3://cloud-versioned-bucket/path </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote version_aware <span class="token boolean">true</span></span></code></pre></div> <p>Use <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to start tracking your model and predictions and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> to sync it to the cloud.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> model.pt predictions </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> </span>11 files pushed</code></pre></div> <admon type="tip"> <p>If you want to start tracking changes to an existing cloud dataset instead of starting from a local copy, see <a href="https://dvc.org/doc/command-reference/import-url#example-tracking-cloud-version-ids" target="_blank" rel="nofollow noopener noreferrer">dvc import-url —version-aware</a>.</p> </admon> <p>DVC adds <code>model.pt.dvc</code> and <code>predictions.dvc</code> files with the version ID (and other metadata) of each file.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">path</span><span class="token punctuation">:</span> predictions <span class="token key atrule">files</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 0.txt <span class="token key atrule">md5</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca <span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span> <span class="token key atrule">cloud</span><span class="token punctuation">:</span> <span class="token key atrule">myremote</span><span class="token punctuation">:</span> <span class="token key atrule">etag</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca <span class="token key atrule">version_id</span><span class="token punctuation">:</span> UkLM3za5T8oH6.EeZCqOrFNBvUnrAlT7 <span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 1.txt <span class="token key atrule">md5</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb <span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span> <span class="token key atrule">cloud</span><span class="token punctuation">:</span> <span class="token key atrule">myremote</span><span class="token punctuation">:</span> <span class="token key atrule">etag</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb <span class="token key atrule">version_id</span><span class="token punctuation">:</span> fE4Fst2Z25sYEjaJo_0mXZzWDT6vQ4Uz</code></pre></div> <p>Next, track <code>model.pt.dvc</code> and <code>predictions.dvc</code> in Git.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> model.pt.dvc predictions.dvc .gitignore </span> <span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">"added and pushed model and predictions"</span></span></code></pre></div> <admon type="tip"> <p>DVC will also make Git ignore <code>model.pt</code> and the <code>predictions</code> folder so that Git only tracks the metadata. For more info on the mechanics of how DVC works, see <a href="https://dvc.org/doc/use-cases/versioning-data-and-models" target="_blank" rel="nofollow noopener noreferrer">Versioning Data and Models</a>.</p> </admon> <p>Now there is a versioned record of the model and predictions in Git commits, and we can revert to any of them without having to manually track version IDs. If someone else clones the Git repo, they can pull the exact versions pushed with that commit, even if those have been overwritten in cloud storage.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> [email protected]:iterative/myrepo </span> <span class="token line"><span class="token input">$ </span><span class="token command">cd</span> myrepo </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> </span>A predictions/ A model.pt 2 files added and 11 files fetched</code></pre></div> <h1 id="for-existing-dvc-users" style="position:relative;">For existing DVC users<a href="#for-existing-dvc-users" aria-label="for existing dvc users permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>If you have versioning enabled on your cloud storage (or can enable it), you may wish to start using <code>version_aware</code> remotes to simplify the structure of your remote (or so you don't have to explain that structure to your colleagues). A <code>version_aware</code> remote is similar to the remotes you already use, except easier to read.</p> <p>A traditional cache-like DVC remote looks like:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5b81c5336415ea6c37f1e551e23eb3ea/39600/remote_cache.png" alt="remote cache" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>A cloud-versioned remote looks like:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/16c473880803d465f38e753ef4d97a06/39600/remote_cloud_versioned.png" alt="remote cloud versioned" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>The other difference is that version IDs get added to the <a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">DVC metafiles</a> during <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">path</span><span class="token punctuation">:</span> predictions <span class="token key atrule">files</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 0.txt <span class="token key atrule">md5</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca <span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span> <span class="token key atrule">cloud</span><span class="token punctuation">:</span> <span class="token key atrule">myremote</span><span class="token punctuation">:</span> <span class="token key atrule">etag</span><span class="token punctuation">:</span> f163358b0b2b89281d6990e82495d6ca <span class="token key atrule">version_id</span><span class="token punctuation">:</span> UkLM3za5T8oH6.EeZCqOrFNBvUnrAlT7 <span class="token punctuation">-</span> <span class="token key atrule">relpath</span><span class="token punctuation">:</span> 1.txt <span class="token key atrule">md5</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb <span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">154</span> <span class="token key atrule">cloud</span><span class="token punctuation">:</span> <span class="token key atrule">myremote</span><span class="token punctuation">:</span> <span class="token key atrule">etag</span><span class="token punctuation">:</span> ec736fcb3b92886399f3577eac2163bb <span class="token key atrule">version_id</span><span class="token punctuation">:</span> fE4Fst2Z25sYEjaJo_0mXZzWDT6vQ4Uz</code></pre></div> <p>This means you need to be more careful about the order in which you <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> and <code>git commit</code>. You should first <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> and then <code>git commit</code> since pushing will modify the DVC metafiles. This might seem odd, but it means you have a record in Git of what was pushed, so there is no more guessing whether you remembered to push.</p> <h1 id="quickstart" style="position:relative;">Quickstart<a href="#quickstart" aria-label="quickstart permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>You can start with DVC cloud versioning in 3 steps:</p> <p><strong>1. Check whether cloud versioning is enabled for your bucket/storage account, and enable it if it's not.</strong></p> <ul> <li><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html" target="_blank" rel="nofollow noopener noreferrer">Amazon S3</a></li> <li><a href="https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-enable" target="_blank" rel="nofollow noopener noreferrer">Azure Storage</a></li> <li><a href="https://cloud.google.com/storage/docs/using-object-versioning" target="_blank" rel="nofollow noopener noreferrer">Google Cloud Storage</a></li> </ul> <p><strong>2. Setup DVC to use that bucket/container as cloud-versioned remote storage.</strong></p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span> </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">--default</span> myremote s3://cloud-versioned-bucket/path </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote version_aware <span class="token boolean">true</span></span></code></pre></div> <p><strong>3. Add and then push data.</strong></p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> model.pt predictions </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div> <hr> <p>Stop messing around with backing up your cloud data! With cloud versioning in DVC, you can iterate on your data as much as you want without losing track of your changes or worrying about your storage growing into an unmanageable mess.</p> <p>Special thanks to <a href="https://github.com/pmrowla" target="_blank" rel="nofollow noopener noreferrer">Peter Rowlands</a> for leading the development of this new capability!</p>https://dvc.org/blog/dvclive-metrics-studiohttps://dvc.org/blog/dvclive-metrics-studioMon, 13 Feb 2023 00:00:00 GMT<p>Computer vision is a complex field requiring much experimentation and trial and error to achieve optimal results. However, managing and tracking the progress of these experiments has not been easy. You can't see it once you've sent it to the server for training. Keeping an eye on its progress over (often) days makes it possible to miss something. This makes it difficult to effectively manage your time and reduce unnecessary resource use. Moreover, a team working on the same project needs to be able to easily share their results with colleagues. This can be challenging with existing (or non-existent) tooling.</p> <p>That's where DVCLive and Iterative Studio come in. These tools offer live experiment tracking and efficient result sharing, making it easy to optimize your experimentation process and streamline the workflow with your team.</p> <p><img src="https://dvc.org/2023-02-13/live_plots-d71c91466267bddf7bf4fcd3598eaee6.gif" alt="Real-time experiment tracking in Iterative Studio"> <em>See experiment results in real-time in Iterative Studio</em></p> <h3 id="the-tools-at-work" style="position:relative;">The tools at work<a href="#the-tools-at-work" aria-label="the tools at work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> is a Python library connected to DVC that provides a real-time experiment logger that allows machine learning engineers to track the metrics and parameters of their experiments. It is beneficial for long-running experiments, which can take hours or even days to complete.</p> <p><a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> is a <a href="https://en.wikipedia.org/wiki/Software_as_a_service" target="_blank" rel="nofollow noopener noreferrer">SaaS</a> platform that displays logged experiments with their metrics, parameters, and plots all tied together and tracked using DVC and Git under the hood. It allows for rich, visual, real-time tracking and sharing of the results, making it easy to collaborate with others and be production-ready efficiently.</p> <p><img src="https://dvc.org/2023-02-13/live_metrics-da579bff70aae9578c94bf2843a92139.gif" alt="Real-time, nested experiment tracking in Iterative Studio"> <em>Real-time, nested experiment tracking in Iterative Studio</em></p> <h3 id="use-case-identifying-and-segmenting-pools-from-satellite-imagery" style="position:relative;">Use case: Identifying and segmenting pools from satellite imagery<a href="#use-case-identifying-and-segmenting-pools-from-satellite-imagery" aria-label="use case identifying and segmenting pools from satellite imagery permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In this computer vision project (see repo <a href="https://github.com/iterative/example-get-started-experiments" target="_blank" rel="nofollow noopener noreferrer">here</a>), we embark on an exciting journey to uncover swimming pools, often obscured from street-level views, right in the middle of our neighborhoods and cities. Using <a href="https://www.mathworks.com/help/deeplearning/ref/resnet18.html" target="_blank" rel="nofollow noopener noreferrer">ResNet-18</a> and <a href="https://www.fast.ai/" target="_blank" rel="nofollow noopener noreferrer">Fast.ai</a>, we will be able to accurately identify and segment pools from satellite images.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d2c6e98d28a4eb4add6f657cf363e5a3/39600/bh-pools-dataset.png" alt="BH-Pools Dataset" title="BH-Pools Dataset" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Images and ground truth segmentation of BH-Pools Dataset (<a href="http://patreo.dcc.ufmg.br/2020/07/29/bh-pools-watertanks-datasets/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <admon type="info"> <p>It's worth noting that the experiment in this example is beyond a toy project by design. It may take around one hour to run on an ordinary laptop, and the time may vary depending on the specific configuration and settings. However, you can use a GPU to speed up the process.</p> </admon> <h3 id="dataset-methods--tools" style="position:relative;">Dataset, Methods & Tools<a href="#dataset-methods--tools" aria-label="dataset methods tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We will use a modified version of the <a href="http://patreo.dcc.ufmg.br/2020/07/29/bh-pools-watertanks-datasets/" target="_blank" rel="nofollow noopener noreferrer">BH-Pools dataset</a>, which consists of high-resolution 4K images of various neighborhoods in the city of Belo Horizonte, Brazil. These images were captured through Google Earth Pro and come pre-annotated with swimming pools and water tanks. For this project, we will focus on just the swimming pools.</p> <p>We have made the dataset more manageable with some pre-processing to crop the images into smaller tiles of 1024x1024 pixels.</p> <p>When using DVCLive in Iterative Studio, we will be able to see the progress of our experiments. Let’s get started!</p> <h3 id="getting-set-up" style="position:relative;">Getting Set up<a href="#getting-set-up" aria-label="getting set up permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Follow the initial setup instructions in the <a href="https://github.com/iterative/example-get-started-experiments" target="_blank" rel="nofollow noopener noreferrer">README</a>. Next, we need to run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> in our root directory to fetch the dataset from our remote. This command retrieves the data from the remote storage and makes it available locally for our experiments. Once the download is complete, we will create a data loader using the label function with <code>SegmentationDataLoaders</code> from the <code>fastai</code> library. This data loader allows us to easily load and preprocess the images (e.g. resizing the images to the desired resolution). You can dig deeper into the code <a href="https://github.com/iterative/example-get-started-experiments/blob/main/src/train.py#:~:text=/%20%22train_data%22-,data_loader%20%3D%20SegmentationDataLoaders.from_label_func(,),-model_names%20%3D%20%5B" target="_blank" rel="nofollow noopener noreferrer">here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bcb76ec2e14257ad23b0dd23be082271/39600/swimming-pools-dataset.png" alt="BH-Pools Dataset" title="BH-Pools Dataset" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Sample of Belo Horizonte Pools Dataset from <code>data_loader</code> (<a href="http://patreo.dcc.ufmg.br/2020/07/29/bh-pools-watertanks-datasets/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>After creating the data loader and resizing the images, we train a ResNet-18 model with unet_learner with varying hyperparameters and utilizing the DVCLiveCallback. The DVCLiveCallback is a built-in logger provided by DVCLive that allows us to track the intermediate results of the training process, such as the loss and accuracy of the model, in real-time. By logging these metrics, we can easily monitor the progress of our model and make adjustments as needed to optimize the training process and improve the performance of the model.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> learn <span class="token operator">=</span> unet_learner<span class="token punctuation">(</span> data_loader<span class="token punctuation">,</span> arch<span class="token operator">=</span><span class="token builtin">getattr</span><span class="token punctuation">(</span>models<span class="token punctuation">,</span> params<span class="token punctuation">.</span>train<span class="token punctuation">.</span>arch<span class="token punctuation">)</span><span class="token punctuation">,</span> metrics<span class="token operator">=</span>DiceMulti <span class="token punctuation">)</span> learn<span class="token punctuation">.</span>fine_tune<span class="token punctuation">(</span> <span class="token operator">**</span>params<span class="token punctuation">.</span>train<span class="token punctuation">.</span>fine_tune_args<span class="token punctuation">,</span> cbs<span class="token operator">=</span><span class="token punctuation">[</span>DVCLiveCallback<span class="token punctuation">(</span><span class="token builtin">dir</span><span class="token operator">=</span><span class="token string">"results/train"</span><span class="token punctuation">,</span> report<span class="token operator">=</span><span class="token string">"md"</span><span class="token punctuation">,</span> dvcyaml<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">)</span></code></pre></div> <p>Additionally, we can also use Studio to analyze and visualize the results of our experiments, making it easy to share and collaborate with others. <a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/live-metrics-and-plots" target="_blank" rel="nofollow noopener noreferrer">By providing the STUDIO_TOKEN</a>, DVCLive will automatically post the results of the experiment to Studio. To do this, first, let’s obtain an individual token from the user profile page in Studio.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/72a87ac1955f69ffe2ff18effb77181c/39600/studio-access-token.png" alt="Generate Iterative Studio Access token" title="Generate Iterative Studio Access token" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Generating Studio Access Token in the Iterative Studio Profile page (<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>By providing this token as an environment variable, we can access the results of our experiments in an <a href="https://dvc.org/doc/studio/get-started" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio project</a>. The project lets you compare them with previous experiments, helps you find insights to improve our model and share it with others.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/05b43930abaa033ef494a947a7cad639/39600/iterative-studio-live-metrics.png" alt="Comparison in Iterative Studio" title="Comparison in Iterative Studio" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Compare with previous experiments in Iterative Studio (<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>To export the token run the command below with the token obtained from your Studio profile:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token builtin class-name">export</span> <span class="token assign-left variable">STUDIO_TOKEN</span><span class="token operator">=</span><span class="token operator"><</span>your-token<span class="token operator">></span></code></pre></div> <p>Running an experiment locally using DVC will now automatically live-update the Studio project(s) associated with your git remote (the one named "origin")</p> <p>You may want to change the parameters and run the experiment again.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">dvc exp run <span class="token parameter variable">-S</span> <span class="token assign-left variable">train.fine_tune_args.epochs</span><span class="token operator">=</span><span class="token number">16</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">train.img_size</span><span class="token operator">=</span><span class="token number">512</span></code></pre></div> <p><img src="https://dvc.org/2023-02-13/exp-run-ed8ebb8ac1c5606d7490f8eeee498460.gif" alt="Experiment tracking in Iterative Studio" title="=800"> <em>Real-time Experiment tracking in Iterative Studio (<a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>As you can see, the change to the epochs and image size brought improvement to the metrics.</p> <p>It's safe to say that if you provide the model with a satellite image of any neighborhood, it will pretty accurately identify all swimming pools in that image! And by using DVCLive and Studio, we were able to track and efficiently control the model training process, without squandering expensive training resources on unfruitful training runs.</p> <h3 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our work has produced a model which is able to accurately identify and segment swimming pools from satellite images! With the help of DVCLive and Iterative Studio, we've been able to visualize results in real-time to make resource-saving decisions. And finally, this work is readily visible for the entire team to review!</p> <p>We’d like to express our gratitude to the creators of the incredible <a href="http://patreo.dcc.ufmg.br/about-us/" target="_blank" rel="nofollow noopener noreferrer">BH-Pools dataset</a>, without which there would have been less fun and less impressive results!</p> <p>You can give Iterative Studio a try by signing up <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">here</a>. Try out the <a href="https://github.com/iterative/example-get-started-experiments" target="_blank" rel="nofollow noopener noreferrer">repo</a> or <a href="https://colab.research.google.com/drive/1NTivljRYiySMJn-SHeWQSycBmSOVUbvA" target="_blank" rel="nofollow noopener noreferrer">colab notebook</a> for this project and let us know what you think in <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> or <a href="https://discuss.dvc.org/t/track-computer-vision-experiments-in-real-time-with-dvclive-in-iterative-studio/1478" target="_blank" rel="nofollow noopener noreferrer">Discourse</a>!</p> <admon type="info"> <p>Learn more about enhancing your machine learning experimentation with these blog posts:</p> <ul> <li><a href="https://iterative.ai/blog/exp-tracking-dvc-python" target="_blank" rel="nofollow noopener noreferrer">Experiment Tracking with DVC and Python</a></li> <li><a href="https://iterative.ai/blog/dvc-hydra-integration/" target="_blank" rel="nofollow noopener noreferrer">DVC and Hydra Integration</a>.</li> </ul> </admon> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/mlem-nanogpt-modal-flyiohttps://dvc.org/blog/mlem-nanogpt-modal-flyioWed, 08 Feb 2023 00:00:00 GMT<h2 id="preparing-data" style="position:relative;">Preparing data<a href="#preparing-data" aria-label="preparing data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To kick off the process, you basically just need a single text file that you want your model to be trained on. For example, I often struggle with writing docs for MLEM framework, so I will try to generate those. <a href="https://github.com/mike0sv/nanoGPT/blob/mlem/data/mlem-docs/prepare.py" target="_blank" rel="nofollow noopener noreferrer">Here</a> you can find my code that clones <a href="https://github.com/iterative/mlem.ai" target="_blank" rel="nofollow noopener noreferrer">mlem.ai repo</a>, compiles every <code>.md</code> from the docs directory into a single text file and then creates a train set using the same code as an example Shakespeare dataset. I also prepended each file’s content with the path to this file, so I can condition the generation for a specific file.</p> <p>Of course, for your own experiments, you can provide different data and train GPT model for a different task.</p> <h2 id="training-the-model" style="position:relative;">Training the model<a href="#training-the-model" aria-label="training the model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thanks to Andrej’s original repo, it’s as easy as cloning and running a couple of commands. My fork has some additional stuff to make it even easier.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">git</span> clone https://github.com/mike0sv/nanoGPT <span class="token operator">&&</span> <span class="token builtin class-name">cd</span> nanoGPT/ <span class="token operator">&&</span> <span class="token function">git</span> checkout <span class="token parameter variable">-b</span> mlem origin/mlem $ pip <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements-mlem.txt <span class="token comment"># Prepare mlem docs dataset</span> <span class="token comment"># Alternatively, you can compile your own training data for different task</span> $ python data/mlem-docs/prepare.py char</code></pre></div> <p>If you don’t have access to GPU, you can use <a href="http://modal.com" target="_blank" rel="nofollow noopener noreferrer">modal.com</a> to train your model without any infrastructure configuration. Just register there, wait for approval, and run <a href="https://github.com/mike0sv/nanoGPT/blob/mlem/modal_train.py" target="_blank" rel="nofollow noopener noreferrer">this script</a> to run the training and download the resulting model checkpoint.</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ modal token new <span class="token comment"># approve in browser</span> $ python modal_train.py <span class="token comment"># you can edit paths or other parameters</span></code></pre></div> <p>Or if you are already working on a machine with GPU, just run the training locally</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># train model</span> $ python train.py config/train_mlemai.py <span class="token parameter variable">--device</span> cuda <span class="token parameter variable">--dtype</span><span class="token operator">=</span>float32 <span class="token parameter variable">--max_iters</span><span class="token operator">=</span><span class="token number">3000</span> <span class="token parameter variable">--init_from</span><span class="token operator">=</span>scratch</code></pre></div> <p>After training you model will be saved at <code>out-mlemai-char/ckpt.pt</code> and you can sample it with</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># sample model</span> $ python sample.py <span class="token parameter variable">--out_dir</span><span class="token operator">=</span>out-mlemai-char <span class="token parameter variable">--dtype</span><span class="token operator">=</span>float32</code></pre></div> <h2 id="deploying-your-model" style="position:relative;">Deploying your model<a href="#deploying-your-model" aria-label="deploying your model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now, to show off your model to friends and colleagues, we will deploy it as a <a href="https://streamlit.io" target="_blank" rel="nofollow noopener noreferrer">Streamlit</a> application to <a href="https://fly.io" target="_blank" rel="nofollow noopener noreferrer">https://fly.io</a>. It’s very easy with <a href="https://mlem.ai" target="_blank" rel="nofollow noopener noreferrer">MLEM</a> Streamlit extension. First, we need to save the model as MLEM model - <a href="https://github.com/mike0sv/nanoGPT/blob/mlem/wrapper.py" target="_blank" rel="nofollow noopener noreferrer">here</a> is the script for that</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ python wrapper.py out-mlemai-char mlem_char</code></pre></div> <p>Now, setup and login into <a href="https://fly.io/docs/hands-on/install-flyctl/" target="_blank" rel="nofollow noopener noreferrer">fly.io</a> and run <code>mlem deploy</code> command. I also prepared a <a href="https://github.com/mike0sv/nanoGPT/blob/mlem/app.py" target="_blank" rel="nofollow noopener noreferrer">custom Streamlit application template</a> you can use to give it more ChatGPT feel</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"> <span class="token comment"># setup flyio</span> $ flyctl auth login $ mlem deploy run flyio app <span class="token parameter variable">-m</span> mlem_char <span class="token punctuation">\</span> <span class="token parameter variable">--app_name</span> mlem-nanogpt <span class="token parameter variable">--scale_memory</span> <span class="token number">1024</span> <span class="token punctuation">\</span> <span class="token parameter variable">--server</span> streamlit <span class="token parameter variable">--server.ui_port</span> <span class="token number">8080</span> <span class="token punctuation">\</span> <span class="token parameter variable">--server.server_port</span> <span class="token number">8081</span> <span class="token parameter variable">--server.template</span> app.py</code></pre></div> <p>After the command finishes, just go to https://<app_name>.fly.dev - in my case its <a href="https://mlem-nanogpt.fly.dev/" target="_blank" rel="nofollow noopener noreferrer">https://mlem-nanogpt.fly.dev/</a> - and start chatting.</p> <p><img src="https://dvc.org/2023-02-08/app-03314cb2a611e772a98a57b05f8e5a77.gif" alt="app.gif"></p> <p>Well, I guess if this is what generated docs look like, I still have a job! 🤣</p> <p>But just for lulz, I re-generated the whole MLEM documentation with this model - you can check it out <a href="https://mlem-ai-nano-gpt-xyinoh8xgobdz.herokuapp.com/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Nowadays it’s really easy to recreate someone else’s work thanks to open source software. And thanks to folks like Andrej and companies like Modal and Fly now it becomes much faster to build and deploy ML models. We are happy to be part of this, with tools like MLEM, DVC, CML and others. Long live the open source!</p>https://dvc.org/blog/mlem-cv-model-deploymenthttps://dvc.org/blog/mlem-cv-model-deploymentThu, 19 Jan 2023 00:00:00 GMT<p>By developing MLEM - a tool that allows researchers to easily deploy their models to production without having to worry about the underlying infrastructure, we strive to help them focus on what they do best: developing and improving their models. This can help accelerate the pace of research and development, and ultimately lead to better and more effective AI systems.</p> <p>MLEM deploy your models in a couple of commands - and in this blog post, we’ll deploy an image classification model to <a href="https://fly.io" target="_blank" rel="nofollow noopener noreferrer">Fly.io</a>. Without any additional user input, MLEM will serve your model with REST API, create a Streamlit application, and build a Docker image with both included. Does this sound like fun? Try out the deployment at <a href="https://mlem-cv.fly.dev" target="_blank" rel="nofollow noopener noreferrer">https://mlem-cv.fly.dev</a> before we start!</p> <h2 id="the-good-part" style="position:relative;">The good part<a href="#the-good-part" aria-label="the good part permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To showcase MLEM power we’ll take a pytorch model and deploy it to the cloud in a couple of simple steps. Just don’t forget to install MLEM and other requirements with <code>pip install torch torchvision mlem[streamlit,flyio]</code>. You’ll also need docker up and running on your machine.</p> <p>First, we need to get the model. To get to model deployment faster, we won’t dive too far into model development and stick to the task at hand by using a pre-trained ResNet model from <code>torchvision</code>:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> torchvision<span class="token punctuation">.</span>models <span class="token keyword">import</span> ResNet50_Weights<span class="token punctuation">,</span> resnet50 weights <span class="token operator">=</span> ResNet50_Weights<span class="token punctuation">.</span>DEFAULT model <span class="token operator">=</span> resnet50<span class="token punctuation">(</span>weights<span class="token operator">=</span>weights<span class="token punctuation">)</span> model<span class="token punctuation">.</span><span class="token builtin">eval</span><span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> <p>Since our model expects tensors of a certain shape, we need some preprocessing to be able to use it with an arbitrary image. And while we’re here, let’s throw some postprocessing on top to get class name from predicted class probabilities. Thankfully, MLEM allows you to do just that:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> torchvision<span class="token punctuation">.</span>io <span class="token keyword">import</span> read_image <span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> save img <span class="token operator">=</span> read_image<span class="token punctuation">(</span><span class="token string">"cat.jpg"</span><span class="token punctuation">)</span> categories <span class="token operator">=</span> weights<span class="token punctuation">.</span>meta<span class="token punctuation">[</span><span class="token string">"categories"</span><span class="token punctuation">]</span> preprocess <span class="token operator">=</span> weights<span class="token punctuation">.</span>transforms<span class="token punctuation">(</span><span class="token punctuation">)</span> save<span class="token punctuation">(</span>model<span class="token punctuation">,</span> <span class="token string">"torch_resnet"</span><span class="token punctuation">,</span> preprocess<span class="token operator">=</span><span class="token keyword">lambda</span> x<span class="token punctuation">:</span> preprocess<span class="token punctuation">(</span>x<span class="token punctuation">)</span><span class="token punctuation">.</span>unsqueeze<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> postprocess<span class="token operator">=</span><span class="token keyword">lambda</span> x<span class="token punctuation">:</span> categories<span class="token punctuation">[</span> x<span class="token punctuation">.</span>squeeze<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>softmax<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">.</span>argmax<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">]</span><span class="token punctuation">,</span> sample_data<span class="token operator">=</span>img<span class="token punctuation">,</span> <span class="token punctuation">)</span></code></pre></div> <p>MLEM will do its metadata-extracting magic on our model, so we get ready-to-serve MLEM Model at <code>torch_resnet</code> path.</p> <p>Now we’re ready for deployment, but before we’d like to play around with it locally. We can use <a href="https://mlem.ai/doc/command-reference/serve" target="_blank" rel="nofollow noopener noreferrer"><code>mlem serve</code></a> to see how it works:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ mlem serve streamlit <span class="token punctuation">\</span> <span class="token parameter variable">--model</span> torch_resnet <span class="token punctuation">\</span> <span class="token parameter variable">--request_serializer</span> torch_image <span class="token comment"># accept images instead of raw tensors</span> Starting streamlit server<span class="token punctuation">..</span>. 🖇️ Adding route <span class="token keyword">for</span> /predict Checkout openapi docs at <span class="token operator"><</span>http://0.0.0.0:8080/docs<span class="token operator">></span> INFO: Started server process <span class="token punctuation">[</span><span class="token number">17525</span><span class="token punctuation">]</span> INFO: Waiting <span class="token keyword">for</span> application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 <span class="token punctuation">(</span>Press CTRL+C to quit<span class="token punctuation">)</span> You can now view your Streamlit app <span class="token keyword">in</span> your browser. URL: http://0.0.0.0:80</code></pre></div> <p>Let's head over to <a href="http://localhost:80" target="_blank" rel="nofollow noopener noreferrer">localhost:80</a> to see if our model is ready for production!</p> <p><img src="https://dvc.org/2023-01-19/streamlit-1fd30393f4cbab125953036101ec878f.gif" alt="Streamlit app"></p> <p>This is already useful: you can play around with your model, demo it to colleagues in a call, or show your pet how it's going to be classified now. Tons of ways to use this - give it a try when in need the next time!</p> <h2 id="cloudification" style="position:relative;">Cloudification<a href="#cloudification" aria-label="cloudification permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>That's cool and all, but what is your model worth if you need to call your friends each time to show it off? MLEM can help in this department too. <a href="https://mlem.ai/doc/command-reference/deployment" target="_blank" rel="nofollow noopener noreferrer">Using <code>mlem deploy</code></a> you can deploy your model to Heroku, Sagemaker, Kubernetes or Flyio (not to mention <a href="https://mlem.ai/doc/command-reference/build" target="_blank" rel="nofollow noopener noreferrer"><code>mlem build</code></a> that can build a Docker image out of your model that you can later deploy yourself).</p> <p>Since a PR for <a href="http://fly.io" target="_blank" rel="nofollow noopener noreferrer">fly.io</a> was just merged, let’s use it:</p> <ul> <li>Go to <a href="http://fly.io" target="_blank" rel="nofollow noopener noreferrer">fly.io</a> and set up an account</li> <li>Install flyctl using <a href="https://fly.io/docs/hands-on/install-flyctl/" target="_blank" rel="nofollow noopener noreferrer">this instruction</a></li> <li>Login via <code>flyctl auth login</code></li> <li>You also need to provide a credit card, but they won't charge you <a href="https://fly.io/docs/about/pricing/#how-it-works" target="_blank" rel="nofollow noopener noreferrer">until you exceed free limits</a>.</li> </ul> <p>Now normally we’d need to write <code>Dockerfile</code>, <code>requirements.txt</code> and other deployment-platform-specific files like <code>Procfile</code>, and then finally use <code>flyctl</code> executable to run an app. But fortunately, we can just run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ mlem deploy run flyio cv-app <span class="token punctuation">\</span> <span class="token parameter variable">--model</span> torch_resnet <span class="token punctuation">\</span> <span class="token parameter variable">--app_name</span> mlem-cv <span class="token punctuation">\</span> <span class="token parameter variable">--scale_memory</span> <span class="token number">1024</span> <span class="token punctuation">\</span> <span class="token parameter variable">--server</span> streamlit <span class="token punctuation">\</span> <span class="token parameter variable">--server.request_serializer</span> torch_image <span class="token punctuation">\</span> <span class="token parameter variable">--server.ui_port</span> <span class="token number">8080</span> <span class="token punctuation">\</span> <span class="token parameter variable">--server.server_port</span> <span class="token number">8081</span></code></pre></div> <p>Now it’s live at <a href="https://mlem-cv.fly.dev" target="_blank" rel="nofollow noopener noreferrer">mlem-cv.fly.dev</a> 🚀</p> <p>Finally, all you have to do now is to brag to your best friend about your achievement:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7c92549a130c055a72fbe0829ae7cf58/39600/best-friend.png" alt="ChatGPT" title="ChatGPT" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h2 id="whats-next" style="position:relative;">What's next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As we promised in our <a href="https://iterative.ai/blog/mlem-k8s-sagemaker/" target="_blank" rel="nofollow noopener noreferrer">last MLEM blog post</a>, we added support for CV models and models that have preprocessing or postprocessing steps. What's next?</p> <ul> <li>We're looking at integrations with specialized CV serving tools like TorchServe, GPU support, and model optimization.</li> <li>We already <a href="https://medium.com/better-programming/i-trained-a-model-to-tell-if-you-were-naughty-this-year-11a36ca6d472" target="_blank" rel="nofollow noopener noreferrer">support NLP scenarios</a>, but we're going to see if there is something special that needs to be implemented there as well.</li> </ul> <p>Feel free to drop us a line in <a href="https://github.com/iterative/mlem/issues" target="_blank" rel="nofollow noopener noreferrer">GH issues</a> if you'd like something specific! See you next time 🐶</p>https://dvc.org/blog/january-2023-heartbeathttps://dvc.org/blog/january-2023-heartbeatTue, 17 Jan 2023 00:00:00 GMT<p>Happy New Year! We are looking forward to what’s going to be a stellar year for us and for all of you! We are hoping for peace to reign, the recession to subside, and success aplenty. 🤞🏼 Are you ready? Let’s do this!</p> <p><img src="https://media.giphy.com/media/JykvbWfXtAHSM/giphy.gif" alt="Lets Do This GIF by National Geographic Channel"></p> <h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We always start with DVC, but this month, in this new year, we’ll start with MLEM! We released MLEM in June of last year and have made <a href="https://iterative.ai/blog/mlem-k8s-sagemaker" target="_blank" rel="nofollow noopener noreferrer">some advances to it already</a>. It seems the Community is learning about it and recognizing its benefits. We are thrilled to see that!</p> <h2 id="mlem-tutorial-video-from-jcharis-jesse" style="position:relative;">MLEM Tutorial Video from JCharis Jesse<a href="#mlem-tutorial-video-from-jcharis-jesse" aria-label="mlem tutorial video from jcharis jesse permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/JCharisTech" target="_blank" rel="nofollow noopener noreferrer"><strong>JCharis Jesse</strong></a> created the <a href="https://www.youtube.com/watch?v=vEoc64xJaK4" target="_blank" rel="nofollow noopener noreferrer">FIRST video tutorial from the Community for MLEM!</a> In this very well-explained and recorded video, Jesse takes you through what MLEM is and where it fits in the machine learning to production process. He follows that by showing the different options of saving a model, where to find the model metadata and how it works, loading the ML model, examples of serving with FastAPI and Docker, and finally applying the model to data for prediction. If you are interested in using MLEM for serving your models, this will definitely help get you started! You can find a ton of other great content on his <a href="https://www.youtube.com/@JCharisTech" target="_blank" rel="nofollow noopener noreferrer">YouTube site</a>.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/vEoc64xJaK4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="tryolabs-top-python-libraries-of-2022" style="position:relative;">Tryolabs Top Python Libraries of 2022<a href="#tryolabs-top-python-libraries-of-2022" aria-label="tryolabs top python libraries of 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>From our friends at <a href="https://tryolabs.com/" target="_blank" rel="nofollow noopener noreferrer">Tryolabs</a>, <a href="https://www.linkedin.com/in/alan-descoins/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alan Descoins</strong></a> and <a href="https://www.linkedin.com/in/facundo-lezama/" target="_blank" rel="nofollow noopener noreferrer"><strong>Facundo Lezama</strong></a> round out 2022 with <a href="https://tryolabs.com/blog/2022/12/26/top-python-libraries-2022" target="_blank" rel="nofollow noopener noreferrer">Tryolabs’ annual picks for the best Python Libraries of 2022</a>. The requirements to make the cut are for libraries that were launched or gained popularity within the year. They have a list of top 10 picks that you will want to take a look at, including <a href="https://lineapy.org/" target="_blank" rel="nofollow noopener noreferrer">LineaPy</a> which helps you convert notebooks to production pipelines. MLEM also made the list in the category of <em>Tools & Enablers</em>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fb378179100dbdc49e9db4e80afeb3ac/39600/tryolabs.png" alt="Tryolabs" title="Tryolabs" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Tryolabs Best Python Libraries of 2022 (<a href="https://tryolabs.com/blog/2022/12/26/top-python-libraries-2022" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="bex-tuychiev---data-version-control-learn-what-other-data-scientists-are-ignoring" style="position:relative;">Bex Tuychiev - Data Version Control: Learn What Other Data Scientists Are Ignoring<a href="#bex-tuychiev---data-version-control-learn-what-other-data-scientists-are-ignoring" aria-label="bex tuychiev data version control learn what other data scientists are ignoring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b9d24af1b8f8422fd44012394ef91049/03346/fiona-art.jpg" alt="Learn What Other Data Scientists are Ignoring with DVC" title="Photo by Fiona Art from Pexels" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> In the first part of a new series on DVC, <a href="https://www.linkedin.com/in/bextuychiev/" target="_blank" rel="nofollow noopener noreferrer"><strong>Bex Tuychiev</strong></a> writes a fire 🔥 tutorial on DVC in <a href="https://towardsdatascience.com/how-to-version-gigabyte-sized-datasets-just-like-code-with-dvc-in-python-5197662e85bd" target="_blank" rel="nofollow noopener noreferrer">Towards Data Science</a> with a computer vision project using the German Traffic Sign Recognition Benchmark Dataset and Tensorflow. He guides you on getting the project properly set up, then how to start adding, tracking, pulling, and pushing files with DVC. Next, he goes over building the image classification model and then concludes with how to create a shared cache if you are working on a large project with a team. Reproducibility and Collaboration for the win! We are looking forward to the next parts of the series!</p> <p><img src="https://media.giphy.com/media/epxDzItQhxAzK/giphy.gif" alt="It Crowd Popcorn GIF"></p> <h2 id="aryan-jadon---survey-of-data-versioning-tools-for-machine-learning-operations" style="position:relative;">Aryan Jadon - Survey of Data Versioning Tools for Machine Learning Operations<a href="#aryan-jadon---survey-of-data-versioning-tools-for-machine-learning-operations" aria-label="aryan jadon survey of data versioning tools for machine learning operations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>For a very nice comparison of Data Versioning Tools, look to <a href="https://www.linkedin.com/in/aryan-jadon/" target="_blank" rel="nofollow noopener noreferrer"><strong>Aryan Jadon’s</strong></a> <a href="https://medium.com/@aryanjadon/analysis-of-data-versioning-tools-for-machine-learning-operations-1cb27146ce49" target="_blank" rel="nofollow noopener noreferrer">recent post on the subject</a>. He seems to hit them all, providing information about their benefits and things of which to be cautious. Naturally, DVC makes this list with the only caution being, “you need to use a Git repository to use DVC’s versioning features." Isn’t Git a part of every modern tech stack? 😉 Staying true to our mission to deliver the best developer experience for machine learning teams by creating an ecosystem of open, modular ML tools!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ab7a7b82f3683b956095a8b0e40529eb/39600/aryan-jadon.png" alt="Survey of Data Versioning Tools for Machine Learning Operations" title="Survey of Data Versioning Tools for Machine Learning Operations" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Deciding on Data Versioning Tools? (<a href="https://medium.com/@aryanjadon/analysis-of-data-versioning-tools-for-machine-learning-operations-1cb27146ce49" target="_blank" rel="nofollow noopener noreferrer">Source link by Mary Amato </a>)</em></p> <h2 id="sami-jawhar---running-parallel-pipelines-with-dvc-and-tpi" style="position:relative;">Sami Jawhar - Running Parallel Pipelines with DVC and TPI<a href="#sami-jawhar---running-parallel-pipelines-with-dvc-and-tpi" aria-label="sami jawhar running parallel pipelines with dvc and tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you couldn’t make the December Meetup, good news! <a href="https://youtu.be/X3M1UfMn2Kk" target="_blank" rel="nofollow noopener noreferrer">The video</a> is already out! <a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a> joined us to share a solution he built to run parallel pipelines with DVC and TPI to save time processing the massive amount of data they use in their brain research at <a href="https://www.kernel.com/" target="_blank" rel="nofollow noopener noreferrer">Kernel</a>. He describes the context of his situation as well as all of its constraints and finally the details of the solution, coined “Neuromancer” after the famous sci-fi novel. Get ready for some mind-blowing engineering! 🤯</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/X3M1UfMn2Kk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="mlem-christmas-project" style="position:relative;">MLEM Christmas Project<a href="#mlem-christmas-project" aria-label="mlem christmas project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><img src="https://media.giphy.com/media/KtrhyNGwNCSYM4pVRq/giphy.gif" alt="Have you been Naughty or Nice?" title="Naughty or Nice MLEMMing" style="width: 300px; float: right; clear: left; padding: 0.5rem"> In case you missed it while you were out for the holidays, <a href="https://www.linkedin.com/in/1aguschin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Guschin</strong></a> and <a href="https://www.linkedin.com/in/mike0sv/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Sveshnikov</strong></a>, your friendly neighborhood MLEM creators, put together <a href="https://medium.com/@mike0sv/i-trained-a-model-to-tell-if-you-were-naughty-this-year-11a36ca6d472" target="_blank" rel="nofollow noopener noreferrer">a fun project using MLEM</a> to determine if you had been naughty or nice just ahead of Santa’s trot around the globe in 2022. In the blog post, you will learn how they DDOS’ed Santa’s website, Trained a Christmas (decision) tree, and Deployed a ML service with MLEM to <a href="https://streamlit.io/" target="_blank" rel="nofollow noopener noreferrer">Streamlit</a> to see the predictions.</p> <p>You can try it out <a href="https://mlem-nice-or-naughty.fly.dev/" target="_blank" rel="nofollow noopener noreferrer">here</a>. And check out how some of our team members fared in <a href="https://www.linkedin.com/posts/1aguschin_streamlit-activity-7012056418816036864-k9hv?utm_source=share&utm_medium=member_desktop" target="_blank" rel="nofollow noopener noreferrer">this LinkedIn post</a>. Spoiler alert: I’m naughty and nice?</p> <h2 id="casper-da-costa-luis-at-mlops-summit---painless-cloud-experiments-without-leaving-your-ide" style="position:relative;">Casper da Costa-Luis at MLOps Summit - Painless cloud experiments without leaving your IDE<a href="#casper-da-costa-luis-at-mlops-summit---painless-cloud-experiments-without-leaving-your-ide" aria-label="casper da costa luis at mlops summit painless cloud experiments without leaving your ide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our CML Product Manager, <a href="https://github.com/casperdcl" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa-Luis'</strong></a> presented in November at MLOps Summit on <em>Painless cloud experiments without leaving your IDE</em>. The presentation is now available on YouTube <a href="https://www.youtube.com/watch?v=PaBQF89URuI" target="_blank" rel="nofollow noopener noreferrer">here</a>. If Full lifecycle management of computing resources (including GPUs and auto-respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)… without needing to be a cloud expert appeals, this talk is for you! He discusses how to move experiments seamlessly between a local laptop, a powerful cloud machine, and your CI/CD of choice.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/PaBQF89URuI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="new-unstructured-data-query-language" style="position:relative;">New Unstructured Data Query Language<a href="#new-unstructured-data-query-language" aria-label="new unstructured data query language permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><strong>Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a new solution for finding and managing your datasets of unstructured data like images, audio files, and PDFs!</strong> Extend your DVC environment with the first unstructured data query language (think SQL -> DQL) for machine learning. We are looking for beta customers for this new tool.</p> <p><a href="https://calendly.com/gtm-2/iterative-datamgmt-overview" target="_blank" rel="nofollow noopener noreferrer">Schedule a meeting with us</a> if that's what you're needing! Find more info <a href="https://iterative.ai/data-catalog-for-ml" target="_blank" rel="nofollow noopener noreferrer">here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2edbe3221d465dd67d0de2903ebd6c73/39600/dvc-cloud.png" alt="Unstructured Data Query Language from the makers of DVC" title="Unstructured Data Query Language from the makers of DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Unstructured Data Query Language Prototype</em></p> <h2 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates!<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our favorite Tweet this month is from <a href="https://twitter.com/the_osbm" target="_blank" rel="nofollow noopener noreferrer"><strong>Osman Bayram</strong></a> who mentions he plans to use CML with <a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">Huggingface</a> GPU. We are looking forward to that! 🍿 I'm seeing a lot of popcorn eating in our future. See you next month!</p> <p><a href="https://twitter.com/the_osbm/status/1606018332175478786?s=20&t=uTKIsTjTv5frJPz2yNPqUw" target="_blank" rel="nofollow noopener noreferrer">Link to Tweet</a></p> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/december-2022-heartbeathttps://dvc.org/blog/december-2022-heartbeatFri, 16 Dec 2022 00:00:00 GMT<admon type="tip"> <p>Unlike most of the text you've read over the past two weeks, this Heartbeat was 100% human generated. 😉</p> </admon> <p>Welcome to December! Wow, what a year! We introduced an online course, added five new tools (TPI, GTO, MLEM, DVC Extension for VS Code, and a Model Registry in Iterative Studio) plus tons of new features to DVC, CML, and Iterative Studio. We also were thrilled to emerge from the pandemic and meet so many of you in person at conferences around the world. We are excited about what's in store for 2023, and we thank you all for being such fantastic community members. While there are still challenging events happening around the globe, there is much to be thankful for and victories to celebrate! Bring on 2023!</p> <p><img src="https://media.giphy.com/media/DEZA7FlHbMesUF1jm9/giphy.gif" alt="Believe Jason Sudeikis GIF by Apple TV"></p> <h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="mlops-guide" style="position:relative;">MLOps Guide<a href="#mlops-guide" aria-label="mlops guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>For their engineering final project at <a href="https://www.insper.edu.br/en/" target="_blank" rel="nofollow noopener noreferrer">Insper,</a> <a href="https://github.com/arthurolga" target="_blank" rel="nofollow noopener noreferrer"><strong>Arthur Olga</strong></a>, <a href="https://github.com/gabriellm1" target="_blank" rel="nofollow noopener noreferrer"><strong>Gabriel Monteiro</strong></a>, <a href="https://github.com/guipleite" target="_blank" rel="nofollow noopener noreferrer"><strong>Guilherme Leite</strong></a>, and <a href="https://github.com/ViniGl" target="_blank" rel="nofollow noopener noreferrer"><strong>Vinicius Lima</strong></a> created the <a href="https://mlops-guide.github.io/" target="_blank" rel="nofollow noopener noreferrer">MLOps Guide</a>, which provides a Complete MLOps development cycle using DVC, CML, and IBM Watson. The multi-page guide covers the principles of MLOps as well as a full tutorial for building an MLOps environment. It covers data and model versioning, feature management and storing, automation of pipelines and processes, CI/CD for machine learning, and continuous monitoring of models. The guide uses both DVC and CML and includes videos outlining the project and much of the coding, as well as a project repository that you can work through.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e19b634d5bfb74525bd6fcfebea425b5/39600/DiagramMLOPs.png" alt="MLOps Guide" title="MLOps Guide" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>MLOps Guide (<a href="https://mlops-guide.github.io/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="turn-vs-code-into-a-one-stop-shop-for-ml-experiments" style="position:relative;">Turn VS Code Into a One-Stop Shop for ML Experiments<a href="#turn-vs-code-into-a-one-stop-shop-for-ml-experiments" aria-label="turn vs code into a one stop shop for ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/eryklewinson/" target="_blank" rel="nofollow noopener noreferrer"><strong>Eryk Lewinson</strong></a> wrote a fabulous, <a href="https://towardsdatascience.com/turn-vs-code-into-a-one-stop-shop-for-ml-experiments-49c97c47db27" target="_blank" rel="nofollow noopener noreferrer">in-depth tutorial</a> on experiment tracking using our new DVC Extension for VS Code. He starts off with, “One of the biggest threats to productivity in recent times is context switching.” As a Community Manager, I can so relate! 😅 He posits that the extension is a great way to both code our experiments and evaluate and compare them happily in our IDE, without having to jump back and forth between platforms.</p> <p><img src="https://dvc.org/2022-12-16/eryk-lewinson-81e150bc16515d76971e4dfdaa417938.gif" alt="DVC Extension for VS Code Experiment Tracking"></p> <p>Eryk uses a credit card risk dataset and project to show most of the capabilities of the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a> and take us through all the steps to show the entire workflow and the resulting project structure. He notes the best points of the extension are its experiment bookkeeping with an emphasis on reproducibility and its extended plotting capabilities including live plotting to visualize model performance while the model is still being trained. He goes over some tricks and functionality of the extension as well.</p> <h3 id="a-fable-about-mlopsand-broken-dreams" style="position:relative;">A Fable About MLOps…and Broken Dreams<a href="#a-fable-about-mlopsand-broken-dreams" aria-label="a fable about mlopsand broken dreams permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d4c9a14e797ce0242da41c27d281855f/39600/alex-burlacu.png" alt="A Fable About MLOps...And Broken Dreams" title="A Fable About MLOps...And Broken Dreams" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>A Fable About MLOps… and Broken Dreams (<a href="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p><a href="https://www.linkedin.com/in/alexandru-burlacu" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Burlacu</strong></a> tells a great story and provides many tips on his experience in MLOps <a href="https://alexandruburlacu.github.io/posts/2022-11-22-mlops-fable" target="_blank" rel="nofollow noopener noreferrer">in this piece</a> on his blog called <em>A Fable About MLOps… and Broken Dreams</em>. The tale is likely all too familiar to many of you in our Community in addition to being validating and entertaining to read. He offers some great prerequisites for beginning your MLOps journey including quickly finding and accessing your data, seeding that model training code, and recording your experiment configuration. Last of these he recommends MLFlow, but as the previous summary from Eryk points out, this can be done very effectively with the new DVC extension AND be truly fully reproducible. 🤗</p> <p>Generally, he recommends starting early and starting small with MLOps. More technically, he recommends a simple data collection and discovery system, data versioning with DVC, replicable experiments, experiment tracking, ML serving, testing, and CI/CD. It's all great advice and fun to read!</p> <h3 id="ml-pipeline-decoupled---i-managed-to-write-a-framework-agnostic-ml-pipeline-with-dvc-rust-and-python" style="position:relative;">ML Pipeline Decoupled - I managed to write a framework-agnostic ml pipeline with DVC, Rust, and Python<a href="#ml-pipeline-decoupled---i-managed-to-write-a-framework-agnostic-ml-pipeline-with-dvc-rust-and-python" aria-label="ml pipeline decoupled i managed to write a framework agnostic ml pipeline with dvc rust and python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2878d680f3bed6dd6fb5751e56ff333/39600/mr-data-psycho.png" alt="Framework Agnostic ML Pipeline with DVC, Rust and Python" title="Rob Toews bets on languge over images" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://www.linkedin.com/in/mr-data-psycho/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sheikh Samsuzzhan Alam, aka Mr. Data Psycho</strong>,</a> writes <a href="https://towardsdev.com/ml-pipeline-decoupled-i-managed-to-write-a-framework-agnostic-ml-pipeline-with-dvc-rust-python-287de68104c9" target="_blank" rel="nofollow noopener noreferrer">this great piece</a> that reminds us that DVC is language agnostic! While Python is the most popular language used in Data Science and with DVC, there are some instances where you may want to use languages such as Rust to speed up memory efficiency and offer a faster solution for parts of your project. The good news is you can! Mr. Data Psycho extols the virtues of DVC’s pipelining feature and shows how to use Rust (Polars) as a pre-processing framework, Sci-kit Learn for model training, and the rest in Python. Using the yaml files, each stage could be put together using dependencies written in whatever language your heart desires! You can find the repo for the project <a href="https://github.com/DataPsycho/mlpipeline-with-dvc" target="_blank" rel="nofollow noopener noreferrer">here</a>. R users may be interested in this related content <a href="https://github.com/jcpsantiago/dvthis" target="_blank" rel="nofollow noopener noreferrer">here</a>, <a href="https://www.youtube.com/watch?v=NwUijrm2U2w&t=2s" target="_blank" rel="nofollow noopener noreferrer">here,</a> and <a href="https://iterative.ai/blog/r-code-and-reproducible-model-development-with-dvc" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="digital-cheatsheet-for-dvc" style="position:relative;">Digital Cheatsheet for DVC<a href="#digital-cheatsheet-for-dvc" aria-label="digital cheatsheet for dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you’d like an online CheatSheet for DVC you can find one <a href="https://cheat.sh/dvc" target="_blank" rel="nofollow noopener noreferrer">here</a> created by <a href="https://twitter.com/igor_chubin" target="_blank" rel="nofollow noopener noreferrer"><strong>Igor Chubin</strong></a>. Pick a command from the drop-down menu and bam 💥, you’ve got the info you need! It’s very cool, but do always remember to check our docs <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>, <a href="https://cml.dev/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>, and <a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">here</a>; we are always updating them!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/14e259393d78eccef563d67bece560f2/39600/cheatsheet.png" alt="DVC Cheat sheet" title="DVC Cheat sheet" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC Cheat Sheet (<a href="https://cheat.sh/dvc" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem" style="position:relative;">Akvelon enables non-Python apps to integrate machine learning models with MLEM<a href="#akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem" aria-label="akvelon enables non python apps to integrate machine learning models with mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/aleksandr-dudko-bb475476/" target="_blank" rel="nofollow noopener noreferrer"><strong>Aleksandr Dudko</strong></a>, <a href="https://www.linkedin.com/in/anatolii-bolshakov-9a25b2199/" target="_blank" rel="nofollow noopener noreferrer"><strong>Anatoly Bolshakov</strong></a>, <a href="https://www.linkedin.com/in/denis-nosov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Denis Nosov</strong></a>, and <a href="https://www.linkedin.com/in/vladimir-krestov-4873391ba/" target="_blank" rel="nofollow noopener noreferrer"><strong>Vladimir Krestov</strong></a>, of <a href="https://akvelon.com/" target="_blank" rel="nofollow noopener noreferrer">Akvelon,</a> wrote <a href="https://akvelon.com/akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem/" target="_blank" rel="nofollow noopener noreferrer">this great tutorial</a> on using MLEM to make the process of integrating, packaging, and deploying machine learning models much easier. In the tutorial, they show how to do this with Akvelon’s .NET and Java clients for use in existing or new Web (ASP.Net, Java Spring), Mobile (Xamarin, Android), and Desktop (WPF, WinForms, Java Spring, Java Spring). Explore the project directory <a href="https://github.com/akvelon/MLEM-SDK-for-Java" target="_blank" rel="nofollow noopener noreferrer">here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/61de316bc2b4b785a53f2d18a96b4009/39600/akvelon.png" alt="Akvelon enables non-Python apps to integrate machine learning models with MLEM" title="Akvelon enables non-Python apps to integrate machine learning models with MLEM" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Akvelon enables non-Python apps to integrate machine learning models with MLEM (<a href="https://akvelon.com/akvelon-enables-non-python-apps-to-integrate-machine-learning-models-with-mlem/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/LdBroIIcAdoj8NuG6Q/giphy.gif" alt="Awesome Thats Lit GIF by Samsung Austria"></p> <h2 id="dvc-live-experiment-tracking" style="position:relative;">DVC Live Experiment Tracking<a href="#dvc-live-experiment-tracking" aria-label="dvc live experiment tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We’ve been listening to the greater Community and know you’d like to see easier experiment tracking from DVC and we’re on it! <a href="https://iterative.ai/blog/exp-tracking-dvc-python?tab=DVC-extension-for-VS-Code" target="_blank" rel="nofollow noopener noreferrer">The latest release of DVCLive</a> helps bring that goal to fruition. Now you can track your experiments with only a couple of lines of code directly from your notebook or your .py file. You can start with just a repo with Git and DVC initialized, using your existing tools; eliminating the need for a hosted solution or setting up a server or database. Keep track of all the metadata related to the experiment in your Git provider of choice (GitHub/GitLab), and your cloud storage, and share with your team when you are ready. In addition, you can use Iterative Studio to share the results of your experiments with teammates.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5e118b266ce2ac73c087600246e067a6/03346/ariel-biller.jpg" alt="Ariel Biller Experiment Tracking meme" title="Ariel Biller Experiment Tracking meme" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ariel Biller's Experiment Tracking meme (<a href="https://twitter.com/untitled01ipynb/status/1593911944989270016?s=20&t=h0rvf7Bi7ikf9E3hna4vYw" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="new-unstructured-data-query-language" style="position:relative;">New Unstructured Data Query Language<a href="#new-unstructured-data-query-language" aria-label="new unstructured data query language permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a new solution for finding and managing your datasets of unstructured data like images, audio files, and PDFs! Extend your DVC environment with the first unstructured data query language (think SQL -> DQL) for machine learning. We are looking for beta customers for this new tool.</p> <p><a href="https://calendly.com/gtm-2/iterative-datamgmt-overview" target="_blank" rel="nofollow noopener noreferrer">Schedule a meeting with us</a> if that's what you're needing!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8e61fa8c2db431382c5a89161f23db10/39600/dvc-cloud.png" alt="Unstructured Data Query Language from the makers of DVC" title="Unstructured Data Query Language from the makers of DVC" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Unstructured Data Query Language Prototype</em></p> <h2 id="gto-tutorial-on-the-blog" style="position:relative;">GTO Tutorial on the Blog<a href="#gto-tutorial-on-the-blog" aria-label="gto tutorial on the blog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A model registry is a tool to catalog ML models and their versions. Models from your data science projects can be discovered, tested, shared, deployed, and audited from there. Learn how to build a model registry in a DVC Git repo without involving any extra services, integrations, and APIs in <a href="https://iterative.ai/blog/gto-model-registry" target="_blank" rel="nofollow noopener noreferrer">this new post</a> from <a href="https://www.linkedin.com/in/1aguschin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Guschin</strong></a>!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bcd104768a32e723be669c72e5520ba0/03346/drawing-owl-step-by-step.jpg" alt="Building a GitOps ML Model Registry with DVC and GTO" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>On January 11th, <a href="https://www.linkedin.com/in/francescocalcavecchia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a> will be joining us to share about his recent contribution to MLEM through his work on GTO and how this helps him in his work at <a href="https://www.eon.de/de/pk.html" target="_blank" rel="nofollow noopener noreferrer">E.On Energie Deutschland</a> with creating a Git-based model registry.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289772002/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Francesco Calcavecchia on Designing a model Registry with Legacy Systems using DVC and GTO</h4> <div class="elp-description">Join us on January 11th. Designing a Model Registry with Legacy Systems using GTO!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-12-16/meetup-4b11eb06e8fc8da7fcb2fe756fabd127.png" alt="Francesco Calcavecchia on Designing a model Registry with Legacy Systems using DVC and GTO"> </div> </a> </section> <p></p> <h2 id="flappy-deevee" style="position:relative;">Flappy DeeVee<a href="#flappy-deevee" aria-label="flappy deevee permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our global, all-remote team works hard, but we also have fun! We have a weekly All-Hands meeting where our teams report progress via pre-recorded video so that everyone can be prepared to discuss the topic during the meeting.</p> <p>As we all level up our video production skills, the videos have started to get more fun! <a href="https://www.linkedin.com/in/jesper-svendsen-10892b1bb/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jesper Svendsen</strong></a> inserted this FlappyDeeVee video in the middle of our Iterative Studio update! Try the game <a href="https://flappycreator.com/flappy.php?id=638f6f7f1e9c8" target="_blank" rel="nofollow noopener noreferrer">here!</a> Confession: I can’t get past the first pipe! 😆</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-12-16/FlappyDeeVee-8bb7e63a03292475db9db6842b23780c.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <p>Stay tuned to <a href="https://iterative.ai/#:~:text=Go%20to%20Twitter-,Subscribe,-for%20updates.%20We" target="_blank" rel="nofollow noopener noreferrer">our Newsletter </a> for more content from the Community and what we will be up to conference-wise in 2023!</p> <h2 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates!<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> team recently made updates to their commands to make them more intuitive. If you were used to the old ones, do not fret, info will pop up in the CLI to remind you if you use the old commands and what the new ones are. In the meantime, you can get up to date on the changes <a href="https://cml.dev/doc/ref" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our <a href="https://iterative.ai/blog/jupyter-notebook-dvc-pipeline" target="_blank" rel="nofollow noopener noreferrer">Notebooks to DVC Pipeline for Reproducible Experiments</a> from <a href="https://www.linkedin.com/in/rcdewit?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAA5CEPkB9fI02IpClBKhRdq2brULPHMhmR8&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BaKm1eO7JQle9sN63j%2FHHFA%3D%3D" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> was noted in <a href="https://twitter.com/dl_weekly" target="_blank" rel="nofollow noopener noreferrer">Deep Learning Weekly.</a></p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🤖 Issue #276 is now live! This week in deep learning: AI with the right dose of curiosity, notebooks to DVC pipelines for reproducible experiments, generating human-level text with contrastive search, an open-source data exploration tool, and more.<a href="https://t.co/JXUkrOEYzC">https://t.co/JXUkrOEYzC</a></p>— Deep Learning Weekly (@dl_weekly) <a href="https://twitter.com/dl_weekly/status/1592900833741393920">November 16, 2022</a></blockquote> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/exp-tracking-dvc-pythonhttps://dvc.org/blog/exp-tracking-dvc-pythonThu, 15 Dec 2022 00:00:00 GMT<p>Did you know that DVC can track experiments? Now you can track experiments in DVC by changing a few lines of your Python code.</p> <p>And with the optional <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a>, you have a full-fledged experiment tracking interface in your IDE!</p> <toggle> <tab title="DVC extension for VS Code"> <p><video controlslist="nodownload" preload="metadata" muted controls style="width:100%;"><source src="/2022-12-15/dvclive_exp_tracking-42c4f5a2c17a7b093355745095508589.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> </tab> <tab title="Notebook"> <p><video controlslist="nodownload" preload="metadata" muted controls style="width:100%;"><source src="/2022-12-15/dvclive_exp_tracking_cli-ec65b9aeb72b88e9e31a4c80af52f5b3.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> </tab> </toggle> <h1 id="why" style="position:relative;">Why?<a href="#why" aria-label="why permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We want to bring the DVC ethos to experiment tracking, but the learning curve for DVC can be steep. That's why we built our Python logging library <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> to make it easy to start.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 430px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/21e2e2944c1c1b11883c74e0932f31b5/39600/another_exp_tracker.png" alt="another exp tracker" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>source: <a href="https://twitter.com/untitled01ipynb/status/1593911944989270016" target="_blank" rel="nofollow noopener noreferrer">https://twitter.com/untitled01ipynb/status/1593911944989270016</a></em></p> <p>All you need to start is a Git repo. There are no logins, servers, databases, or UI to spin up. Every experiment run is saved in a Git commit, but those commits are hidden so they don't clutter your repo, unlike saving each run to a separate directory, or creating a Git branch for each.</p> <p>From that simple starting point, DVC experiment tracking grows with your project. You don't have to decide today whether you will need to share with your team or backup to cloud storage. That's because DVC builds on top of the tools you already use and allows you to incrementally integrate them.</p> <p>When you need to <a href="https://dvc.org/doc/user-guide/experiment-management/sharing-experiments" target="_blank" rel="nofollow noopener noreferrer">share</a>, push existing experiments to your Git provider (GitHub/GitLab). When you need artifact <a href="https://dvc.org/doc/start/data-management/data-versioning#storing-and-sharing" target="_blank" rel="nofollow noopener noreferrer">storage</a>, add your own cloud provider and push your existing artifacts. When you need a UI, use VS Code or add <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> for a collaborative interface.</p> <h1 id="how-to-start" style="position:relative;">How to start<a href="#how-to-start" aria-label="how to start permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Check out the example <a href="https://github.com/iterative/dvclive-exp-tracking" target="_blank" rel="nofollow noopener noreferrer">repo</a>, try it out in a <a href="https://colab.research.google.com/drive/1VKEBdSgFdEjg-k6FqNXX-0o83QWcpmN_?usp=sharing" target="_blank" rel="nofollow noopener noreferrer">colab notebook</a>, or follow the steps below to start with your own model training code.</p> <ol> <li> <p>Install DVC>=2.38.0 as a library in your Python environment.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> dvc</span></code></pre></div> </li> <li> <p>Setup a DVC repo where your model training code is (or use an existing repo).</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git init</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token parameter variable">-A</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">"setup dvc repo"</span></span></code></pre></div> </li> <li> <p>In your code, enable DVC experiment tracking using <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> with <code>save_dvc_exp=True</code>. Use the callback for your framework or log your own metrics. You can find examples below (<a href="https://dvc.org/doc/dvclive/api-reference/ml-frameworks" target="_blank" rel="nofollow noopener noreferrer">other frameworks available</a>):</p> </li> </ol> <toggle> <tab title="Pytorch Lightning"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>lightning <span class="token keyword">import</span> DVCLiveLogger <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> trainer <span class="token operator">=</span> Trainer<span class="token punctuation">(</span>logger<span class="token operator">=</span>DVCLiveLogger<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>model<span class="token punctuation">)</span></code></pre></div> </tab> <tab title="Hugging Face"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>huggingface <span class="token keyword">import</span> DVCLiveCallback <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> trainer<span class="token punctuation">.</span>add_callback<span class="token punctuation">(</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">)</span> trainer<span class="token punctuation">.</span>train<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> </tab> <tab title="Keras"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DVCLiveCallback <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span> train_dataset<span class="token punctuation">,</span> validation_data<span class="token operator">=</span>validation_dataset<span class="token punctuation">,</span> callbacks<span class="token operator">=</span><span class="token punctuation">[</span>DVCLiveCallback<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div> </tab> <tab title="General Python API"> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvclive <span class="token keyword">import</span> Live <span class="token keyword">with</span> Live<span class="token punctuation">(</span>save_dvc_exp<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span> <span class="token keyword">as</span> live<span class="token punctuation">:</span> live<span class="token punctuation">.</span>log_param<span class="token punctuation">(</span><span class="token string">"epochs"</span><span class="token punctuation">,</span> NUM_EPOCHS<span class="token punctuation">)</span> <span class="token keyword">for</span> epoch <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>NUM_EPOCHS<span class="token punctuation">)</span><span class="token punctuation">:</span> train_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span> metrics <span class="token operator">=</span> evaluate_model<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span> <span class="token keyword">for</span> metric_name<span class="token punctuation">,</span> value <span class="token keyword">in</span> metrics<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> live<span class="token punctuation">.</span>log_metric<span class="token punctuation">(</span>metric_name<span class="token punctuation">,</span> value<span class="token punctuation">)</span> live<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> </tab> </toggle> <p>4. Run your code and track the experiment results.</p> <toggle> <tab title="DVC extension for VS Code"> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/37aa508562076134553fffd26e1c8c4b/39600/dvclive_exp_tracking.png" alt="dvclive exp tracking" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> </tab> <tab title="Command line"> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token comment"># Show the experiments table in the terminal.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> </span> ──────────────────────────────────────────────────────────────────────────────────── Experiment Created train_loss epoch step encoder_size ──────────────────────────────────────────────────────────────────────────────────── workspace - 0.020196 4 500 512 main Dec 06, 2022 - - - - ├── c1759a5 [quare-foil] 08:55 PM 0.020196 4 500 512 ├── affedee [bitty-tass] 08:55 PM 0.02038 4 500 256 ├── a5bdc18 [murky-emeu] 08:55 PM 0.016396 4 500 128 ├── 744f3b6 [sworn-wage] 08:54 PM 0.01972 4 500 64 └── 0c3ac81 [named-gaby] 08:54 PM 0.031206 4 500 32 ──────────────────────────────────────────────────────────────────────────────────── <span class="token comment"># Plot the diff of all experiments in an HTML file.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token variable"><span class="token variable">$(</span>dvc exp list --name-only<span class="token variable">)</span></span> </span>file:///Users/dave/Code/dvclive-exp-tracking/dvc_plots/index.html</code></pre></div> <p>Open the HTML to see the plots:</p> <p><img src="https://dvc.org/2022-12-15/dvclive_exp_tracking_plots_diff-4da17e97756bf8f97e5ad63ca9f8ca3c.svg" alt="" title="=500"></p> </tab> </toggle> <h1 id="stay-tuned" style="position:relative;">Stay tuned<a href="#stay-tuned" aria-label="stay tuned permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>That's all there is to it! There's lots more coming for DVC experiment tracking, including:</p> <ul> <li> <p><strong>Showing you where to go from here</strong>. Share your experiments, add data or pipelines, and use DVC without ever leaving your notebook or Python IDE.</p> </li> <li> <p><strong>Adding more DVCLive features</strong>. Share realtime updates to <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a>, log data and model artifacts, and compare experiments in Python.</p> </li> </ul> <p>Try out the <a href="https://github.com/iterative/dvclive-exp-tracking" target="_blank" rel="nofollow noopener noreferrer">repo</a> or <a href="https://colab.research.google.com/drive/1VKEBdSgFdEjg-k6FqNXX-0o83QWcpmN_?usp=sharing" target="_blank" rel="nofollow noopener noreferrer">colab notebook</a> and let us know what you think in <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> or <a href="https://github.com/iterative/dvc/issues" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>https://dvc.org/blog/gto-model-registryhttps://dvc.org/blog/gto-model-registryWed, 07 Dec 2022 00:00:00 GMT<p>Machine Learning is iterative in its nature. Similar to developing software, you’re going to have many different versions of your models, improving them step by step (such as <code>v0.1.0</code>, <code>v0.2.0</code>, etc). To keep track of model development, trigger checks, and deployments, and know which versions are in production and which are stuck in staging (both right now and retrospectively), ML specialists organize models' lifecycles using Model Registries.</p> <h2 id="the-pluses-and-minuses-of-model-registries" style="position:relative;">The Pluses and Minuses of Model Registries<a href="#the-pluses-and-minuses-of-model-registries" aria-label="the pluses and minuses of model registries permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>While model registries solve operational issues, many solutions come at a cost. Model Registries often introduce a separate software stack that must be learned, integrated with, and maintained. For example, if you keep your model training code in Git, train your models with CI/CD, and use CI/CD to deploy them, introducing a separate service in the middle of the process breaks the flow and forces you to leave your code versioning ecosystem (Git + GitHub for example). This happens when we add more and more systems and services that all try to be the center of attention. A good example is working with MLFlow or SageMaker as a model registry - there’s a feeling it’s always “in the way” of the Git-based development workflow.</p> <h2 id="our-git-based-solution-to-model-registry" style="position:relative;">Our Git-based Solution to Model Registry<a href="#our-git-based-solution-to-model-registry" aria-label="our git based solution to model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To help you with that, we developed a CLI tool named <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a>. The tool is very simple - it organizes Model Registry in your Git repo using Git tags and a file called <code>artifacts.yaml</code>. Welcome to this short tutorial on how to do just that - and it's simpler than you might think.</p> <p>Before we start, let’s take a look at <a href="https://iterative.ai/model-registry" target="_blank" rel="nofollow noopener noreferrer"><strong>Studio Model Registry</strong></a>, which provides a nice UI dashboard on top of GTO-managed registries:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/63d892c74b838ee3620d17e5dd877e95/39600/iterative-studio-model-registry.png" alt="Iterative Studio Model Registry" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> The model dashboard above has three models from a single Git repo (we’ll add another one in a minute). <a href="https://github.com/iterative/demo-bank-customer-churn/tags" target="_blank" rel="nofollow noopener noreferrer">Git tags</a> in this repo represent the version registrations (such as <code>v2.0.1</code> or <code>v1.0.1</code>) and stage assignments (like <code>dev</code>, <code>prod</code>, and <code>staging</code>) done by team members (assigning <code>v1.0.0</code> to <code>dev</code> signals the version is ready to be deployed to the <code>dev</code> environment and can trigger that deployment directly).</p> <admon type="tip"> <p>Take a look around in our <a href="https://studio.datachain.ai/team/Iterative/models" target="_blank" rel="nofollow noopener noreferrer">demo Model Registry</a> to get a feel for Iterative Studio's Model Registry features.</p> </admon> <p>GTO provides a simplistic representation of the same from CLI, thus accessible from a terminal and friendly for a developer:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> https://github.com/iterative/demo-bank-customer-churn </span>╒════════════════════╤══════════╤════════╤═════════╤══════════╕ │ name │ latest │ <span class="token comment">#dev │ #prod │ #stage │</span> ╞════════════════════╪══════════╪════════╪═════════╪══════════╡ │ randomforest-model │ v2.0.0 │ v2.0.0 │ v1.0.0 │ - │ │ xgboost-model │ v1.0.1 │ - │ - │ v1.0.0 │ │ lightgbm-model │ v2.0.3 │ v2.0.3 │ v2.0.0 │ v2.0.0 │ ╘════════════════════╧══════════╧════════╧═════════╧══════════╛</code></pre></div> <p>Notice that GTO works with a single repo at a time - that’s why we need to specify the <code>--repo</code> argument, while Studio aggregates your models from multiple projects and repositories you add to it.</p> <p>For this tutorial, we'll pick a simple project with no models registered yet, to demonstrate adding a model registry on top of an existing ML project. We'll take <a href="https://github.com/iterative/example-get-started" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/example-get-started</a>, which is an example DVC project. We won’t get into details about DVC, but if you’re new to it, you can check out <a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">DVC Get Started</a>. Revisit the example project before we start to get a quick picture of it if you wish.</p> <p>The project trains a natural language processing (NLP) binary classifier predicting tags for a given StackOverflow question. It uses DVC Pipelines to connect raw text preprocessing and model training, producing an ML model stored in the <code>model.pkl</code>. The <code>main</code> branch has a model version we can consider as the first version, while the branch <code>try-large-dataset</code> is a promising experiment that we’d like to mark as the second version and assign to the <code>dev</code> stage to trigger a deployment.</p> <p>To start, we need to <a href="https://github.com/iterative/example-get-started/fork" target="_blank" rel="nofollow noopener noreferrer">fork the repo</a>, since we’re going to make some changes to it. Note that you need to uncheck "Copy the <code>main</code> branch only" because we'll be using the <code>try-large-dataset</code> branch as well:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f4725bb8104a35916038304c9aac6e22/39600/fork-uncheck.png" alt="fork" title="fork" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>To use GTO from CLI, we'll set up a Python virtual environment:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> <span class="token parameter variable">-m</span> venv .venv </span><span class="token line"><span class="token input">$ </span><span class="token command">source</span> .venv/bin/activate </span><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> gto</span></code></pre></div> <p>To remove some friction, we won’t clone the repo locally. This will save us from running <code>commit</code> and <code>push</code> to update the remote repo, and GTO will do that for us.</p> <h2 id="registering-a-model-version" style="position:relative;">Registering a model version<a href="#registering-a-model-version" aria-label="registering a model version permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In the repo, we have an already trained ML model saved as <code>model.pkl</code>. The file itself resides in an AWS S3 bucket and is tracked with DVC. One of the versions of that model can be found in the HEAD of the <code>main</code> branch. Let’s register the very first version of it - <a href="https://semver.org/" target="_blank" rel="nofollow noopener noreferrer"><code>v0.0.1</code></a>. Since we’ll be using our remote repo many times here, we'll set a shell var <code>$REPO</code> to store the URL.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">REPO</span><span class="token operator">=</span>https://github.com/<span class="token punctuation">{</span>user<span class="token punctuation">}</span>/example-get-started </span><span class="token line"><span class="token input">$ </span><span class="token gto">gto register</span> classifier <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> </span>Created git tag '[email protected]' that registers a version Running `git push origin [email protected]` Successfully pushed git tag [email protected] on remote.</code></pre></div> <p>Now the model is called <code>classifier</code> in our registry and the <code>v0.0.1</code> version is registered in the tip of the <code>main</code> branch.</p> <p>Since the repo we're working with is a remote one, GTO pushes a tag to the repo automatically. With a local repo, you will need to run <code>git push</code> yourself (although you can make GTO do that by providing a <code>--push</code> argument). This workflow should be familiar to DVC and Git users - making changes locally and then pushing them to remote with an additional command.</p> <p>Now we can see the model dashboard of our registry:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> </span>╒════════════╤══════════╕ │ name │ latest │ ╞════════════╪══════════╡ │ classifier │ v0.0.1 │ ╘════════════╧══════════╛</code></pre></div> <p>Remember, that we only see a single <code>classifier</code> model because GTO works with a single repo and the models we’ve seen above were from another repository (notice the <code>--repo</code> argument).</p> <p>A common case is to use a model registry as a source of truth to pull models for experimentation locally or in CI for deployments. Note that for now we manually provide the path to the model (<code>model.pkl</code>) and Git revision to use (<code>[email protected]</code>). We’ll learn how to dynamically set them up using GTO in the next sections.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token variable">$REPO</span> model.pkl <span class="token parameter variable">--rev</span> [email protected] <span class="token parameter variable">-o</span> model.pkl</span></code></pre></div> <h2 id="adding-optional-model-metadata" style="position:relative;">Adding optional model metadata<a href="#adding-optional-model-metadata" aria-label="adding optional model metadata permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To skip hardcoding a model path in our scripts or writing model description somewhere in the notebook, we need to store metadata about the model in the repo itself. Unlike the Git tag, we created to register a version, GTO stores metadata in a file, which requires us to create a commit. This allows us to have different paths or descriptions in different commits and branches, which can be useful if you’ll be updating your model significantly or changing the structure of your repo. Since the model is not annotated right now, let’s add that information to the new model version in the <code>try-large-dataset</code> branch that <a href="https://studio.datachain.ai/team/Iterative/projects/example-get-started-zde16i6c4g" target="_blank" rel="nofollow noopener noreferrer">increased ROC AUC of the model</a>. Later we can merge this to <code>main</code> to update the annotation there:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto annotate</span> classifier<span class="token punctuation">\</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> <span class="token punctuation">\</span> <span class="token parameter variable">--rev</span> try-large-dataset <span class="token punctuation">\</span> <span class="token parameter variable">--path</span> model.pkl <span class="token punctuation">\</span> <span class="token parameter variable">--description</span> <span class="token string">"Simple text classification model"</span> </span> --type model Updated `artifacts.yaml` Running `git commit` and `git push` Successfully pushed a new commit to remote.</code></pre></div> <p>This creates an <code>artifacts.yaml</code> file with the following contents in the <code>try-large-dataset</code> branch:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">classifier</span><span class="token punctuation">:</span> <span class="token key atrule">path</span><span class="token punctuation">:</span> model.pkl <span class="token key atrule">description</span><span class="token punctuation">:</span> Simple text classification model <span class="token key atrule">type</span><span class="token punctuation">:</span> model</code></pre></div> <h2 id="registering-another-version" style="position:relative;">Registering another version<a href="#registering-another-version" aria-label="registering another version permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Since GTO allows you to build any kind of registry, including dataset registry, model registry, or a mix of both, to distinguish between different artifact types (e.g. a <code>dataset</code> and a <code>model</code>), it’s good to specify <code>type</code> while annotating. This will also hint to Studio that <code>classifier</code> is a <code>model</code> so Studio could display it in Studio Model Registry.</p> <p>Let’s register a new version in the <code>try-large-dataset</code> branch:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto register</span> classifier <span class="token punctuation">\</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> <span class="token punctuation">\</span> <span class="token parameter variable">--rev</span> try-large-dataset </span>Created git tag '[email protected]' that registers version Running `git push origin [email protected]` Successfully pushed git tag [email protected] on remote.</code></pre></div> <p>Checking the updated model dashboard:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> </span>╒════════════╤══════════╕ │ name │ latest │ ╞════════════╪══════════╡ │ classifier │ v0.0.2 │ ╘════════════╧══════════╛</code></pre></div> <p>The latest version of <code>classifier</code> is now <code>v0.0.2</code>.</p> <p>To download the model and use it locally, now we can let GTO resolve the path from the value stored in <code>artifacts.yaml</code>, and download it using DVC: script and can use the value stored in the repo:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">REVISION</span><span class="token operator">=</span>[email protected] </span><span class="token line"><span class="token input">$ </span><span class="token command">MODEL_PATH</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span>gto describe classifier $REPO <span class="token parameter variable">--rev</span> $REVISION <span class="token parameter variable">--path</span><span class="token variable">)</span></span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token variable">$REPO</span> <span class="token variable">$MODEL_PATH</span> <span class="token parameter variable">--rev</span> <span class="token variable">$REVISION</span> <span class="token parameter variable">-o</span> <span class="token variable">$MODEL_PATH</span></span></code></pre></div> <h2 id="assigning-stages-to-deploy-a-model" style="position:relative;">Assigning stages to deploy a model<a href="#assigning-stages-to-deploy-a-model" aria-label="assigning stages to deploy a model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now, we have two registered versions of our model: <code>v0.0.1</code> and <code>v0.0.2</code>. How do we get one of them into production? To signal the model version is ready to be used in some environment, we can assign it to a stage:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto assign</span> classifier <span class="token punctuation">\</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> <span class="token punctuation">\</span> <span class="token parameter variable">--version</span> v0.0.2 <span class="token punctuation">\</span> <span class="token parameter variable">--stage</span> dev </span>Created Git tag 'classifier#dev#1' that assigns stage Running `git push origin classifier#dev#1` Successfully pushed git tag classifier#dev#1 on remote.</code></pre></div> <p>To actually start the deployment process, we'll need to set up a CI/CD that can be triggered by pushing a Git tag. We'll discuss this in the next section.</p> <p>Now the model dashboard will be updated with the newly assigned <code>dev</code> stage:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token gto">gto show</span> <span class="token parameter variable">--repo</span> <span class="token variable">$REPO</span> </span>╒════════════╤══════════╤════════╕ │ name │ latest │ <span class="token comment">#dev │</span> ╞════════════╪══════════╪════════╡ │ classifier │ v0.0.2 │ v0.0.2 │ ╘════════════╧══════════╧════════╛</code></pre></div> <p>When running <a href="https://dvc.org/doc/gto/command-reference/show"><code>gto show</code></a> for a specific model, we will get all of its registered versions. Notice that the stage is marked at the latest version that was assigned to it - to signal the currently deployed model version in that stage:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">$ gto show classifier --repo $REPO ╒════════════╤═══════════╤═══════════╤═══════════════════╕ │ artifact │ version │ stage │ ref │ ╞════════════╪═══════════╪═══════════╪═══════════════════╡ │ classifier │ v0.0.2 │ dev │ [email protected] │ │ classifier │ v0.0.1 │ │ [email protected] │ ╘════════════╧═══════════╧═══════════╧═══════════════════╛</code></pre></div> <p>Having dozens of models, it’s easier to automate figuring out what versions are currently assigned to stages. For that, we can use a variation of the <code>show</code> command. To download the <code>classifier</code> version in <code>dev</code>:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">REVISION</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span>gto show classifier<span class="token comment">#dev --repo $REPO --ref</span><span class="token variable">)</span></span> </span><span class="token line"><span class="token input">$ </span><span class="token command">MODEL_PATH</span><span class="token operator">=</span><span class="token variable"><span class="token variable">$(</span>gto describe classifier <span class="token parameter variable">--repo</span> $REPO <span class="token parameter variable">--rev</span> $REVISION<span class="token variable">)</span></span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token variable">$REPO</span> <span class="token variable">$MODEL_PATH</span> <span class="token parameter variable">--rev</span> <span class="token variable">$REVISION</span> <span class="token parameter variable">-o</span> <span class="token variable">$MODEL_PATH</span></span></code></pre></div> <h2 id="starting-cicd-for-new-versions-and-assignments" style="position:relative;">Starting CI/CD for new versions and assignments<a href="#starting-cicd-for-new-versions-and-assignments" aria-label="starting cicd for new versions and assignments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>CI/CD is a common way to set up some automation - including building your models into Docker images or deploying them to Kubernetes or SageMaker. Since new versions and stage assignments are implemented using Git tags, they can automatically kick off CI/CD process that you can set up with <a href="https://docs.github.com/en/actions" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions</a> or any other CI/CD tool, allowing you to programmatically react with actions you would like to perform.</p> <p>Showing a full CI/CD example is worthy of a dedicated blog post, so we’ll save it for another time. If you want to see how it works, there are two examples in the <a href="https://github.com/iterative/example-gto/actions" target="_blank" rel="nofollow noopener noreferrer">GTO example repo</a>. The one in the <code>main</code> branch <a href="https://github.com/iterative/example-gto/blob/main/.github/workflows/gto-act-on-tags.yml" target="_blank" rel="nofollow noopener noreferrer">shows how to parse a Git tag</a> to react on new versions and stage assignments differently, while the other in the <code>mlem</code> branch explains <a href="https://github.com/iterative/example-gto/blob/mlem/.github/workflows/deploy-model-with-mlem.yml" target="_blank" rel="nofollow noopener noreferrer">how to deploy your model in a single line</a> with <a href="https://github.com/iterative/example-gto/blob/mlem/.github/workflows/deploy-model-with-mlem.yml" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>.</p> <h2 id="taking-a-high-level-look-at-our-model-registry" style="position:relative;">Taking a high-level look at our Model Registry<a href="#taking-a-high-level-look-at-our-model-registry" aria-label="taking a high level look at our model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We just learned how to register semantic model versions, assign stages to them, and employ CI/CD to act on those, all using a GitOps approach. Used together with DVC, this allows us to accomplish the main use cases for a powerful model registry, while not introducing any extra services and staying inside a Git Repo.</p> <p>As we saw above, GTO works within a single repo and requires you to work in CLI. To lift these limitations, we introduced Iterative Studio Model Registry which, in a nutshell, is a friendly UI that allows you to work with GTO artifacts gathered from multiple repositories. This is what <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">Studio Model Registry</a> will look like if you log in and add the repo:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/84f17464955e2d4b4d341bf40212d5e3/39600/iterative-studio-model-registry-2.png" alt="Iterative Studio Model Registry" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Besides the <code>classifier</code> model that we just registered, you can also see three other models from our example <code>demo-bank-customer-churn</code> repo.</p> <p>Behind the scenes, <a href="https://dvc.org/doc/studio" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio just uses GTO API</a>, so there are no new magic tricks here (and you can also use GTO API from your automation Python code if you wish). Feel free to play around to register more versions, assign stages or annotate the other models you have, and see how Studio can help you track model lineage, audit events, and connect model versions to DVC experiments.</p> <h2 id="whats-next" style="position:relative;">What’s next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Check out <a href="https://mlem.ai/doc/gto/" target="_blank" rel="nofollow noopener noreferrer">GTO docs</a> to learn more about the tool and ask us questions in <a href="https://discord.com/channels/485586884165107732/903647230655881226" target="_blank" rel="nofollow noopener noreferrer">Discord</a> - we’re happy to help you!</p> <p>Take a look at our <a href="https://studio.datachain.ai/team/Iterative/models" target="_blank" rel="nofollow noopener noreferrer">public Model Registry</a> so you can see for yourself how Iterative Studio puts together a Git based Model Registry experience.</p> <p>Share your feedback in <a href="https://discord.com/channels/485586884165107732/903647230655881226" target="_blank" rel="nofollow noopener noreferrer">Discord</a> or <a href="https://github.com/iterative/gto/issues" target="_blank" rel="nofollow noopener noreferrer">GitHub issues</a> to help us build an open-source Model Registry on top of Git, so you can stick to an existing software engineering stack. No more divide between ML engineering and operations!</p>https://dvc.org/blog/november-2022-heartbeathttps://dvc.org/blog/november-2022-heartbeatFri, 18 Nov 2022 00:00:00 GMT<p>Welcome to November! In the US, this is the time of year we reflect and give thanks. It's been a productive year despite the world's rather extreme challenges. There's lots to be thankful for. Here are some of those things from the last month in the Iterative Community.</p> <h1 id="ai-news" style="position:relative;">AI News<a href="#ai-news" aria-label="ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h3 id="robert-toews---the-biggest-opportunity-in-generative-ai-is-language-not-images" style="position:relative;">Robert Toews - The Biggest Opportunity in Generative AI Is Language, Not Images<a href="#robert-toews---the-biggest-opportunity-in-generative-ai-is-language-not-images" aria-label="robert toews the biggest opportunity in generative ai is language not images permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 200px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/34a929a6ca9a22a5520ff7aa9b90ef39/03346/forbes.jpg" alt="NLP" title="Rob Toews bets on languge over images" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://www.forbes.com/sites/robtoews/2022/11/06/the-biggest-opportunity-in-generative-ai-is-language-not-images/?sh=303a5719789d" target="_blank" rel="nofollow noopener noreferrer">In this article</a> entitled <em>The Biggest Opportunity In Generative AI Is Language, Not Images</em>, <a href="https://www.linkedin.com/in/robtoews/" target="_blank" rel="nofollow noopener noreferrer"><strong>Robert Toews</strong></a> argues that AI-powered text generation will create many orders of magnitude more value than text-generated images.</p> <blockquote> <p>Language is humanity’s single most important invention. More than anything else, it is what sets us apart from every other species on the planet. Language enables us to reason abstractly, to develop complex ideas about what the world is and could be, to communicate these ideas to one another, and to build on them across generations and geographies. Almost nothing about modern civilization would be possible without language.</p> </blockquote> <p>He points out the many examples from a variety of industries and academia that have gained and will continue to gain massive improvements due to the power of large language models (LLMs) in the coming years. Read the article for all the applications.</p> <h3 id="state-of-ai-report" style="position:relative;">State of AI Report<a href="#state-of-ai-report" aria-label="state of ai report permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The <a href="https://docs.google.com/presentation/d/1WrkeJ9-CjuotTXoa4ZZlB3UPBXpxe4B3FMs9R9tn34I/edit#slide=id.g164b1bac824_0_2794" target="_blank" rel="nofollow noopener noreferrer">State of AI Report</a> is generated each year and reports on the most interesting things the authors, <a href="https://twitter.com/nathanbenaich" target="_blank" rel="nofollow noopener noreferrer"><strong>Nathan Benaich</strong></a>, <a href="https://twitter.com/soundboy" target="_blank" rel="nofollow noopener noreferrer"><strong>Ian Hogarth</strong></a>, <a href="https://twitter.com/osebbouh" target="_blank" rel="nofollow noopener noreferrer"><strong>Othmane Sebbouh</strong></a>, and <a href="https://twitter.com/nitarshan" target="_blank" rel="nofollow noopener noreferrer"><strong>Nitarshan Rajkumar</strong></a> come across in the world of AI throughout the year.</p> <ul> <li>Slide 22: Mirroring the ideas of the Toews article above, this slide discusses the LLM use case of conversational code generation. OpenAI's Codex, which powers <a href="https://github.com/features/copilot" target="_blank" rel="nofollow noopener noreferrer">GitHub's Copilot</a> to produce this capability was on display at the recent <a href="https://watch.githubuniverse.com/home" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a>. Other companies including Salesforce, Google, and DeepMind are working on Code generating projects of their own with Google's LLM PaLM coming out as a favored option with 50x less code than Codex. Alternatively DeepMind's AlphaCode generates the whole program as opposed to lines of code.</li> <li>Slide 24: Continuing to echo Toews' article, in research LLMs are greatly improving their mathematical abilities, jumping to far better scores than previous model versions. Techniques that helped to achieve these gains are discussed</li> <li>Slides 30 and 31: Challenging Toews' stance, these slides show the great progress in Computer Vision. Diffusion models are doing more than just text-to-image generation. Now they are being used for text-to-video, text generation, audio, molecular design, and more. Info on the techniques now being used can be found in Slide 30. Side 31 discusses the huge improvement in the next generation of text-to-image generation competing models including DALL-E, Imagen, and Parti.</li> </ul> <p>Be sure to digest the whole report for even more AI advances!</p> <p>💓 So for our “Pulse check” this month:</p> <admon type="tip"> <p>Do you agree that NLP will have more impact than computer vision? Tell us about what you are working on with NLP. We’d love to get you connected with others struggling with similar issues and know how we can improve our tools to help you with your NLP projects.</p> </admon> <p>Join us in the <code>#general</code> channel in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> to weigh in.</p> <h1 id="community-content-highlights" style="position:relative;">Community Content Highlights<a href="#community-content-highlights" aria-label="community content highlights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="thank-you-hacktoberfest-contributors" style="position:relative;">Thank you Hacktoberfest Contributors!<a href="#thank-you-hacktoberfest-contributors" aria-label="thank you hacktoberfest contributors permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We would like to thank <a href="https://github.com/francesco086" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a>, <a href="https://github.com/vvssttkk" target="_blank" rel="nofollow noopener noreferrer"><strong>vvssttkk</strong></a>, and <a href="https://github.com/deepyaman" target="_blank" rel="nofollow noopener noreferrer"><strong>deepyaman</strong></a> for their contributions to <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a>, <a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">MLEM,</a> and <a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML</a> respectively. They will be receiving their own personalized shirts that note their contributions! And many thanks to <a href="https://www.linkedin.com/in/mertbozkir/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mert Bozkir</strong></a> for leading the Hacktoberfest charge here at Iterative!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e80ba8968ec0e28cc7bcd1e8eb624382/39600/hacktoberfest.png" alt="Hacktoberfest Contributors" title="Hacktoberfest Contributors" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>2022 Hacktoberfest Contributions</em></p> <h2 id="joão-santiago-and-team-presenting-on-their-use-of-dvc-at-the-nlp-in-closure-session-2-event" style="position:relative;">João Santiago and team presenting on their use of DVC at the NLP in Closure Session 2 event<a href="#jo%C3%A3o-santiago-and-team-presenting-on-their-use-of-dvc-at-the-nlp-in-closure-session-2-event" aria-label="joão santiago and team presenting on their use of dvc at the nlp in closure session 2 event permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>One of our Community Champions, <a href="https://www.linkedin.com/in/jcpsantiago/" target="_blank" rel="nofollow noopener noreferrer"><strong>João Santiago</strong></a> of <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io</a> gives an introduction to DVC in preparation for the remainder of the session where <a href="https://scicloj.github.io/blog/predict-real-vs.-fake-disaster-tweets/" target="_blank" rel="nofollow noopener noreferrer"><strong>Carsten Behring</strong></a>, author of <a href="https://cljdoc.org/d/scicloj/metamorph/0.2.1/doc/readme" target="_blank" rel="nofollow noopener noreferrer">Metamorph</a> and the <a href="https://github.com/scicloj/scicloj.ml" target="_blank" rel="nofollow noopener noreferrer">scicloj.ml</a> platform presents how NLP pipelines can be managed with DVC, Closure & Python.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/eubg-fjRh9E?rel=0&%3B=&%3Bshowinfo=0%3B&start=914" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="cml-at-neurips" style="position:relative;">CML at NeurIPS<a href="#cml-at-neurips" aria-label="cml at neurips permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Last month we reported on CML turning up in research <a href="https://iterative.ai/blog/october-heartbeat#cml" target="_blank" rel="nofollow noopener noreferrer">here</a>. Well, this work will be presented within the virtual Workshop <a href="https://neurips.cc/media/PosterPDFs/NeurIPS%202022/62157.png" target="_blank" rel="nofollow noopener noreferrer">Challenges In Deploying and Monitoring Machine Learning Systems</a> at NeurIPS virtual this year on December 9th. <a href="https://neurips.cc/" target="_blank" rel="nofollow noopener noreferrer">Find out more and register here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7bd565f0a5e5e75c1083e91b224cc9b8/39600/cml-neurips.png" alt="CML at NeurIPS" title="CML at NeurIPS" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Research on CML to be presented at NeurIPS (<a href="https://neurips.cc/media/PosterPDFs/NeurIPS%202022/62157.png" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="new-unstructured-data-catalog" style="position:relative;">New Unstructured Data Catalog<a href="#new-unstructured-data-catalog" aria-label="new unstructured data catalog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Do you use Amazon S3, Azure Blob Storage, or Google Cloud Storage? We have a new solution for finding and managing your datasets of unstructured data like images, audio files, and PDFs! Extend your DVC environment with the first data catalog and query language (SQL->DQL) for unstructured data and machine learning. Learn more on <a href="https://iterative.ai/data-catalog-for-ml" target="_blank" rel="nofollow noopener noreferrer">our website</a> and/or <a href="https://calendly.com/gtm-2/iterative-datamgmt-overview" target="_blank" rel="nofollow noopener noreferrer">schedule a meeting with us</a>!</p> <h2 id="mlem" style="position:relative;">MLEM<a href="#mlem" aria-label="mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 250px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c737dd4c5b3890a090185b9f3ed858b6/39600/dog-on-a-broomstick.png" alt="MLEM Sagemaker and Kubernetes deployment" title="MLEM adds Kubernetes and Sagemaker Deployment" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> In case you missed it MLEM announced a release on Halloween! MLEM now supports <a href="https://mlem.ai/doc/user-guide/deploying/sagemaker" target="_blank" rel="nofollow noopener noreferrer">Sagemaker</a> and <a href="https://mlem.ai/doc/user-guide/deploying/kubernetes" target="_blank" rel="nofollow noopener noreferrer">Kubernetes</a> in addition to <a href="https://mlem.ai/doc/user-guide/deploying/heroku" target="_blank" rel="nofollow noopener noreferrer">Heroku</a> and <a href="https://mlem.ai/doc/user-guide/deploying/docker" target="_blank" rel="nofollow noopener noreferrer">Docker</a>. You can learn about how easy it now is to package your models for deployment with only a few lines of code and never have to get lost in Kubernetes docs again! Find the <a href="https://iterative.ai/blog/mlem-k8s-sagemaker" target="_blank" rel="nofollow noopener noreferrer">blog post here</a> and be sure to <a href="https://mlem.ai/doc/user-guide/deploying" target="_blank" rel="nofollow noopener noreferrer">visit the docs</a>!</p> <h2 id="soc-2-type-1-compliance" style="position:relative;">SOC 2 Type 1 Compliance<a href="#soc-2-type-1-compliance" aria-label="soc 2 type 1 compliance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 250px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/844739961f1b26d85af1f3657ed1f21e/39600/soc-2-cover.png" alt="Iterative Achieves SOC 2 Type 1 Compliance" title="Iterative Achieves SOC 2 Type 1 Compliance" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> We are very excited to announce that Iterative is now SOC 2 Type 1 compliant. This certification signals to our customers our commitment to Security, Availability, Processing Integrity, Confidentiality, and Privacy within our organization. We have successfully endured the rigorous process and have learned much as a team in the process. <a href="https://www.linkedin.com/in/gurobokum/" target="_blank" rel="nofollow noopener noreferrer"><strong>Guro Bokum</strong></a> reviews the five key learnings <a href="https://iterative.ai/blog/SOC-2" target="_blank" rel="nofollow noopener noreferrer">in this blog piece</a>. You can find the full report on our <a href="https://iterative.ai/security-and-privacy" target="_blank" rel="nofollow noopener noreferrer">Security and Privacy</a> page.</p> <h2 id="dmitry-petrov-at-github-universe" style="position:relative;">Dmitry Petrov at GitHub Universe<a href="#dmitry-petrov-at-github-universe" aria-label="dmitry petrov at github universe permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>On November 8th, our CEO, <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> spoke at <a href="https://githubuniverse.com/" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a> on <em>ML with Git: experiment tracking in Codespaces.</em> In his presentation, he shows how to use the DVC extension for VS Code and Codespaces to streamline your machine learning experimentation process. You can find his video below in the event platform if you are registered. We expect the video to be available on YouTube in the next of couple months. We'll keep you updated!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e0832e89812503e54d5cd6f6b33c73ab/03346/gh-universe.jpg" alt="Dmitry Petrov at GitHub Universe" title="Dmitry Petrov at GitHub Universe" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Dmitry Petrov during his talk, 𝗠𝗟 𝘄𝗶𝘁𝗵 𝗚𝗶𝘁: 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝘁𝗿𝗮𝗰𝗸𝗶𝗻𝗴 𝗶𝗻 𝗖𝗼𝗱𝗲𝘀𝗽𝗮𝗰𝗲𝘀</em></p> <h2 id="rob-de-wit---from-jupyter-notebook-to-dvc-pipeline-for-reproducible-ml-experiments" style="position:relative;">Rob de Wit - From Jupyter Notebook to DVC pipeline for reproducible ML experiments<a href="#rob-de-wit---from-jupyter-notebook-to-dvc-pipeline-for-reproducible-ml-experiments" aria-label="rob de wit from jupyter notebook to dvc pipeline for reproducible ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Jupyter Notebooks are great for prototyping, but eventually, you will want to move toward reproducible experiments. Converting a notebook to a DVC pipeline requires a bit of a mental shift. <a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> shows you how to accomplish it with an intermediate step: use <a href="https://papermill.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">Papermill</a> to build a one-stage DVC pipeline that executes our entire notebook, and use the resulting pipeline to run and version ML experiments. Look out for a future post with a more advanced pipeline!</p> <p><img src="https://media.giphy.com/media/wnWvARibI7pykx0mTf/giphy.gif" alt="Dvc GIF"></p> <h2 id="meetups" style="position:relative;">Meetups<a href="#meetups" aria-label="meetups permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>At our next meetup on December 14th, <a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a> will present <em>An Open Discussion of Parallel data pipelines with DVC and TPI</em>, an advanced use case for distributing experiments in the cloud. Sami is a great discussion driver. If you are interested in higher-level use cases you will want to join the discussion!</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289771497/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Sami Jawhar on Running Parallel Pipelines with DVC and TPI</h4> <div class="elp-description">Join us on December 14th for an open discussion on Running Parallel Pipelines with DVC and TPI!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-11-18/meetup-4329bdadd3d5940e6d7cd7bf26842a27.png" alt="Sami Jawhar on Running Parallel Pipelines with DVC and TPI"> </div> </a> </section> <p></p> <p>On January 11th, <a href="https://www.linkedin.com/in/francescocalcavecchia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a> will be joining us to share about his recent contribution to MLEM through his work on GTO and how this helps him in his work at <a href="https://www.eon.de/de/pk.html" target="_blank" rel="nofollow noopener noreferrer">E.On Energie Deutschland</a> with creating a Git-based model registry.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289772002/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Francesco Calcavecchia on Designing a model Registry with Legacy Systems using DVC and GTO</h4> <div class="elp-description">Join us on January 11th. Designing a Model Registry with Legacy Systems using GTO!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-11-18/meetup-4329bdadd3d5940e6d7cd7bf26842a27.png" alt="Francesco Calcavecchia on Designing a model Registry with Legacy Systems using DVC and GTO"> </div> </a> </section> <p></p> <h2 id="events" style="position:relative;">Events<a href="#events" aria-label="events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="odsc-west" style="position:relative;">ODSC West<a href="#odsc-west" aria-label="odsc west permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We had a great time at <a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a>! We had great conversations with conferencegoers and attended great sessions! Dmitry had a packed room for his in-person talk <em>Why You Need a GitOps-based Machine Learning Model Registry</em> and <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> presented <em>CI/CD for Machine Learning</em> virtually. At each of the conferences we've sponsored this year, we've had a game called Deevee's Ramen Run. (If you don't know the Ramen connection, you need to spend more time reading the monthly Heartbeats 😉). Below find the top three winners of the game.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6675d96b036ae7ca3df5ba3acd230859/39600/winners.png" alt="Winners of DeeVees Ramen Run" title="Winners of DeeVees Ramen Run" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Winners 1st - 3rd shown above: Alexandra Hagmeyer (pictured with myself and teammate Daniel Barnes), Ryan Renslow, and (name asked to be withheld, but she was good with the picture and DeeVee!)</em></p> <h3 id="mlops-summit-london" style="position:relative;">MLOps Summit London<a href="#mlops-summit-london" aria-label="mlops summit london permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We were also part of the <a href="https://london-ml-ops.re-work.co/" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit in London</a> only a week later! Admittedly, there were different team members in attendance and staffing the booth. Aside from attending a variety of great talks, we met many wonderful people from all over the world. This resulted in some really interesting discussions about how different companies approach MLOps.</p> <p>Casper da Costa-Luis gave a well-received talk on how to painlessly run ML experiments in the cloud with CML at the summit. The recording will be made available in the near future, so look out for that! The talk answered at least one of the questions of Deevee's Ramen Run, which yielded <a href="https://www.linkedin.com/posts/rebecca-gorringe_machinelearning-iterative-reworkai-activity-6998338419772772353-FUip?utm_source=share&utm_medium=member_desktop" target="_blank" rel="nofollow noopener noreferrer">some surprised (but excited!) winners</a> this time around.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/94acec2e1c21ff406e9be6340cda88ed/39600/team.png" alt="Iterative Team at MLOps Summit - London" title="Iterative Team at MLOps Summit - London" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative Team members, clockwise from top right: Rob de Wit, Gema Parreño Piqueras, Casper da Costa-Luis, and Chaz Black)</em></p> <h3 id="techweek" style="position:relative;">TechWeek<a href="#techweek" aria-label="techweek permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> presented at <a href="https://www.ambito.com/negocios/tecnologia/comenzo-la-tech-week-latam-y-espana-mas-600-ofertas-empleo-it-n5578240" target="_blank" rel="nofollow noopener noreferrer">TechWeek</a> in Spain with her talk <em>Reproducibilty and Version Control are Important: Follow up with the DVC extension for VS Code</em>. She will be presenting the same talk at <a href="https://events.codemotion.com/conferences/online/2022/online-tech-conference-2022-spanish-edition-autumn" target="_blank" rel="nofollow noopener noreferrer">Codemotion</a>. You can find her talk in Spanish at 2:02 below!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/zXl9qINlbcI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="upcoming-events" style="position:relative;">Upcoming events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>We will be participating in <a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a> - on November 29-30 in Toronto</li> <li><a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> <em>CI/CD for Machine Learning</em> for an ODSC Webinar. <a href="https://app.aiplus.training/courses/CI-CD-for-Machine-Learning" target="_blank" rel="nofollow noopener noreferrer">Register here.</a></li> <li>We will be at <a href="https://pydata.org/eindhoven2022/" target="_blank" rel="nofollow noopener noreferrer">PyData Eindhoven</a> on December 2nd. Come say hi at the booth if you are attending! We have some tickets to give away for the event in <a href="https://discord.com/channels/485586884165107732/497187456051970048/1036999675951190056" target="_blank" rel="nofollow noopener noreferrer">Discord</a>. First come first serve!</li> <li>We are sponsoring <a href="https://normconf.com/" target="_blank" rel="nofollow noopener noreferrer">NormConf</a> on December 15th. They will have Slack-based booths there. We are looking forward to supporting this new conference!</li> </ul> <p>Stay tuned to <a href="https://iterative.ai/#:~:text=Go%20to%20Twitter-,Subscribe,-for%20updates.%20We" target="_blank" rel="nofollow noopener noreferrer">our Newsletter </a> for what we will be up to conference-wise in 2023!</p> <h2 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates!<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><img src="https://media.giphy.com/media/BemKqR9RDK4V2/giphy.gif" alt="Computer Working GIF"></p> <p>The team has been busy improving the docs for you. See all the latest and greatest updates below.</p> <h3 id="dvc-docs" style="position:relative;">DVC Docs<a href="#dvc-docs" aria-label="dvc docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li><a href="https://dvc.org/doc/api-reference/dvcfilesystem" target="_blank" rel="nofollow noopener noreferrer">DVCFileSystem</a> - DVCFileSystem provides a pythonic file interface ( <a href="https://filesystem-spec.readthedocs.io/" target="_blank" rel="nofollow noopener noreferrer">fsspec-compatible</a> ) for a DVC repo. It is read-only. DVCFileSystem provides a unified view of all the files/directories in your repository, be it Git-tracked or DVC-tracked, or untracked (in the case of a local repository). It can reuse the files in the DVC cache and can otherwise stream from <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">supported remote storage</a>.</li> <li>We’ve now added <a href="https://dvc.org/doc/command-reference/plots/show#example-horizontal-bar-plot" target="_blank" rel="nofollow noopener noreferrer">Horizontal bar plots</a> to the mix of <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a> !</li> <li>You can now list contents from supported URLs with <a href="https://dvc.org/doc/command-reference/list-url"><code>dvc ls-url</code></a> Find the description, options, and example code <a href="https://dvc.org/doc/command-reference/list-url" target="_blank" rel="nofollow noopener noreferrer">here.</a></li> <li>Based on some feedback we reorganized the <a href="https://dvc.org/doc/user-guide/overview" target="_blank" rel="nofollow noopener noreferrer">User Guide</a> to help you better navigate. Let us know what you think!</li> <li>Similarly, we reorganized the <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive documentation</a> for better navigation.</li> </ul> <h3 id="cml-docs" style="position:relative;">CML docs<a href="#cml-docs" aria-label="cml docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>In CML you can now publicly self-host images with <code>cml comment</code>. Find the options <a href="https://cml.dev/doc/ref/comment#--publish" target="_blank" rel="nofollow noopener noreferrer">here.</a></li> <li>Also, we’ve updated the <a href="https://cml.dev/doc/self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">self-hosted runners</a> docs in CML.</li> <li>We've now added a guide for bringing your data to GitLab using DVC. Find the details <a href="https://cml.dev/doc/cml-with-dvc?tab=GitLab" target="_blank" rel="nofollow noopener noreferrer">in this doc.</a></li> </ul> <h3 id="mlem-docs" style="position:relative;">MLEM docs<a href="#mlem-docs" aria-label="mlem docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li><a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">MLEM docs</a> have received a nearly full overhaul.</li> <li>Additionally the <a href="https://mlem.ai/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started</a> section has been greatly improved.</li> <li>Look out for new docs to come out soon for <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> on the <a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">MLEM</a> website.</li> </ul> <h3 id="iterative-studio-docs" style="position:relative;">Iterative Studio docs<a href="#iterative-studio-docs" aria-label="iterative studio docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>DataChain Studio now supports adding a model from a remote location in Iterative Studio. Find out more <a href="https://dvc.org/doc/studio/user-guide/model-registry/add-a-model" target="_blank" rel="nofollow noopener noreferrer">here</a>.</li> <li>Use the new Iterative Studio Wizard to set up CML in your CI. More on the process and parameters <a href="https://dvc.org/doc/studio/user-guide/projects-and-experiments/run-experiments#use-the-iterative-studio-wizard-to-set-up-your-ci-action" target="_blank" rel="nofollow noopener noreferrer">here in the docs.</a></li> </ul> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/mlem-k8s-sagemakerhttps://dvc.org/blog/mlem-k8s-sagemakerMon, 31 Oct 2022 00:00:00 GMT<p>To establish the deployment to cloud platforms, you have to learn how they work, their secrets, and their quirks. To simplify your daily Swiss-army-knife ML duties, you’ll need to write complicated bash scripts, figure out what arguments needs to be supplied to the platform CLI tool or API methods, call them in the correct way and embrace the burden of limitless extension of your knowledge to one more Cloud Platform tool.</p> <p>But, it doesn’t have to always be that way. Some tools like Terraform help you with managing the infrastructure in a cloud-agnostic way, so why can’t we invent the same for MLOps?</p> <p>That’s why we’re releasing new Deployment mechanics for MLEM, along with 4 supported deployment targets: <a href="https://mlem.ai/doc/user-guide/deploying/docker" target="_blank" rel="nofollow noopener noreferrer">Docker container deploy</a>, <a href="https://mlem.ai/doc/user-guide/deploying/heroku" target="_blank" rel="nofollow noopener noreferrer">Heroku</a>, <a href="https://mlem.ai/doc/user-guide/deploying/kubernetes" target="_blank" rel="nofollow noopener noreferrer">Kubernetes</a>, and <a href="https://mlem.ai/doc/user-guide/deploying/sagemaker" target="_blank" rel="nofollow noopener noreferrer">AWS SageMaker</a>.</p> <p><img src="https://media.giphy.com/media/bfOb3UnSzQvTsBKLmq/giphy.gif" alt="Docker, Heroku, Kubernetest, and SageMaker, in person"></p> <h1 id="deploying-with-a-single-command" style="position:relative;">Deploying with a single command<a href="#deploying-with-a-single-command" aria-label="deploying with a single command permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>MLEM strives to abstract away all the stuff you need to do for deployment. Once you configure kubectl with your cluster IP and credentials, you can deploy your model as simple as:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem deployment run</span> kubernetes app.mlem <span class="token punctuation">\</span> <span class="token parameter variable">--model</span> model <span class="token parameter variable">--service_type</span> loadbalancer </span>⏳️ Loading model from model.mlem 💾 Saving deployment to service_name.mlem 🛠 Creating docker image app 🛠 Building MLEM wheel file... 💼 Adding model files... 🛠 Generating dockerfile... 💼 Adding sources... 💼 Generating requirements file... 🛠 Building docker image app:4ee45dc33804b58ee2c7f2f6be447cda... ✅ Built docker image app:4ee45dc33804b58ee2c7f2f6be447cda namespace created. status='{'conditions': None, 'phase': 'Active'}' deployment created. status='{'available_replicas': None, 'collision_count': None, 'conditions': None, 'observed_generation': None, 'ready_replicas': None, 'replicas': None, 'unavailable_replicas': None, 'updated_replicas': None}' service created. status='{'conditions': None, 'load_balancer': {'ingress': None}}' ✅ Deployment app is up in mlem namespace</code></pre></div> <p>The <code>app.mlem</code> is a file that is going to have all the information about the deployment that we specified. Later we can use it to deploy a new model version.</p> <p>This created <code>deployment</code> and <code>service</code> resources in the cluster. Let’s check out pods that were created by the <code>deployment</code> (all the resources are placed in <code>mlem</code> namespace by default):</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">kubectl</span> get pods <span class="token parameter variable">--namespace</span> mlem </span>NAMESPACE NAME READY STATUS RESTARTS AGE mlem app-cddbcc89b-zkfhx 1/1 Running 0 5m58s</code></pre></div> <h1 id="getting-predictions" style="position:relative;">Getting predictions<a href="#getting-predictions" aria-label="getting predictions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Since our model above is reachable by HTTP request, we can just open the URL and see the OpenAPI spec there (like <a href="http://example-mlem-get-started-app.herokuapp.com/docs" target="_blank" rel="nofollow noopener noreferrer">this one</a>), or send requests to get predictions. We can also use built-in MLEM functionality to achieve the same:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem deployment apply</span> app.mlem data.csv <span class="token parameter variable">--json</span> </span>[0, 1, 2]</code></pre></div> <h1 id="extend-your-learning" style="position:relative;">Extend your learning<a href="#extend-your-learning" aria-label="extend your learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>That’s it: deployment to cloud providers is as simple as it can be. MLEM helps you to simplify your daily routine and help you focus on developing the models and not spending time getting into the DevOps weeds.</p> <ul> <li>To learn how MLEM can help you, try out the <a href="https://mlem.ai/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started Tutorial</a> or <a href="https://mlem.ai/doc/use-cases" target="_blank" rel="nofollow noopener noreferrer">Use Cases</a>.</li> <li>To see a full-scale Tutorial for Kubernetes, Sagemaker or Heroku, check out our <a href="https://mlem.ai/doc/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a>.</li> <li>To quickly get your questions answered, reach us in <a href="https://discord.com/channels/485586884165107732/903647230655881226" target="_blank" rel="nofollow noopener noreferrer">Discord</a> or <a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">GitHub issues</a>.</li> </ul> <h1 id="whats-next" style="position:relative;">What’s next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>It’s been five months since we released MLEM on the 1st of June, and now it’s October 31st already. With all these big deployment targets, MLEM finally looks like a formidable little dog 🎃. What’s next on the agenda?</p> <ul> <li>We’re going to work on an <strong><a href="https://github.com/iterative/mlem/issues/454" target="_blank" rel="nofollow noopener noreferrer">e2e Computer Vision scenario</a></strong>. Think about training a NN to classify images, saving it with MLEM, and deploying it to K8s or Sagemaker.</li> <li>We are going to share how to use MLEM when your model <a href="https://github.com/iterative/mlem/issues/283" target="_blank" rel="nofollow noopener noreferrer">consists of two parts: <strong>preprocessing and inference</strong></a>.</li> <li>Batch processing is something we received many requests about. We’ll set up an example of how to use <a href="https://github.com/iterative/mlem/issues/11" target="_blank" rel="nofollow noopener noreferrer"><strong>MLEM with Airflow</strong></a> and publish it. 📚</li> </ul> <p>Happy to hear your thoughts on this!</p> <p>Machine Learning should be <del>mlemming</del> scary! Once a year only.</p> <p><img src="https://media.giphy.com/media/dlYIz2AoqR5GcqZ1Yk/giphy.gif" alt="Happy Halloween!"></p>https://dvc.org/blog/jupyter-notebook-dvc-pipelinehttps://dvc.org/blog/jupyter-notebook-dvc-pipelineMon, 24 Oct 2022 00:00:00 GMT<p>While every data scientist has their own methods and approaches to conducting data science, there is one tool that nearly everyone in the field uses: <a href="https://jupyter.org/" target="_blank" rel="nofollow noopener noreferrer">Jupyter Notebook</a>. Its ease of use makes it the perfect tool for prototyping, usually resulting in a script in which we preprocess the data, do a train/test split, train our model, and evaluate it.</p> <p>However, once we have a decent prototype, the subsequent iterations generally don’t touch most of the code. Instead, we tend to focus on tweaking feature engineering parameters and tuning model hyperparameters. At this point, we really start experimenting, trying to answer questions such as <em>“What happens if I increase the learning rate?”</em> and <em>“What’s the optimal batch size?”</em></p> <p>It will take numerous experiments to get to an acceptable level of performance for our model. But with so many experiments, it becomes difficult to keep track of the changes. In turn, this makes it difficult to go back in time to a certain point and see what combination of data, code, and parameters constituted a specific experiment. In other words, we cannot <em>reproduce</em> our experiments.</p> <admon type="info"> <p>Reproducibility is a core concept of our data science philosophy here at Iterative. If you are new to the concept, I recommend reading <a href="https://iterative.ai/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">this blog post by Dave Berenbaum</a> or <a href="https://neptune.ai/blog/how-to-solve-reproducibility-in-ml" target="_blank" rel="nofollow noopener noreferrer">this one by Ejiro Onoso</a>.</p> </admon> <p>We can solve our need for reproducibility by transforming our notebook into a codified pipeline with defined inputs and outputs. This will allow us to then save every experiment that modifies the inputs, pipeline, or outputs. In this guide, we will explore how to do this using <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>. It extends Git so that in addition to code and parameters we can track and version data and models.</p> <h2 id="what-well-be-doing" style="position:relative;">What we’ll be doing<a href="#what-well-be-doing" aria-label="what well be doing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>While a pipeline typically consists of multiple <em>stages</em>, transforming our notebook straight into a multi-stage DVC pipeline may seem somewhat daunting. For the sake of simplicity, we will create a pipeline with just one stage for now: run all of the code in our notebook. Just like any other pipeline, we will have defined inputs (data and parameters) and outputs (model, evaluation metrics, and plots).</p> <p>To achieve this, we will wrap our notebook with <a href="https://papermill.readthedocs.io/en/latest/usage-workflow.html" target="_blank" rel="nofollow noopener noreferrer">Papermill</a>. With this tool, we can parameterize our notebook and run experiments <a href="https://papermill.readthedocs.io/en/latest/usage-execute.html#execute-via-cli" target="_blank" rel="nofollow noopener noreferrer">from our CLI with a single command</a>.</p> <p>Throughout this guide, we will do the following:</p> <ol> <li>Parameterize a notebook using Papermill;</li> <li>Create a single-stage pipeline with DVC;</li> <li>Version our data, model, and other large artifacts using DVC; and</li> <li>Run multiple experiments using the new pipeline.</li> </ol> <p>As an example project, we will be using a notebook I created that trains a classifier for Pokémon sprites. You can find this project in <a href="https://github.com/iterative/example-pokemon-classifier/tree/snapshot-jupyter" target="_blank" rel="nofollow noopener noreferrer">the repository here</a>. Make sure to follow the instructions in <code>README.md</code> to set up the development environment and to <code>git checkout snapshot-jupyter</code> to get our starting point for this guide.</p> <p>Of course, you can also follow along using a notebook you created yourself! In that case, you will at least need to install <code>dvc</code> and <code>papermill</code>. You will also need to initialize DVC through <a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a>.</p> <admon type="tip"> <p>If you're using <a href="https://code.visualstudio.com/" target="_blank" rel="nofollow noopener noreferrer">Visual Studio Code</a> as your IDE, I also recommend <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">installing the DVC extension</a>. This will make it even easier to run and compare experiments!</p> </admon> <h2 id="guide" style="position:relative;">Guide<a href="#guide" aria-label="guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Stages in a DVC pipeline consist of commands as we could run them in our own terminal. As such, we need a way to run the contents of our notebook from our command line. This is where Papermill comes in. With the following command we execute the entire notebook as a single unit without changing its contents:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ papermill <span class="token punctuation">\</span> notebooks/pokemon_classifier.ipynb <span class="token punctuation">\</span> outputs/completed_notebook.ipynb</code></pre></div> <p>The result is saved as a new notebook in <code>outputs/completed_notebook.ipynb</code>.</p> <h3 id="parameterize-notebook" style="position:relative;">Parameterize notebook<a href="#parameterize-notebook" aria-label="parameterize notebook permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>While we would technically have a DVC pipeline if we added this command as a stage, its usefulness would be somewhat limited. After all, the result would be the same every time we execute the command. To start experimenting with our pipeline, we need to parameterize our notebook. We do so by creating a single cell at the top of our notebook where we declare the parameters:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">SEED<span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token operator">=</span> <span class="token number">42</span> POKEMON_TYPE_TRAIN<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"Water"</span> SOURCE_DIRECTORY<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"data/external"</span> DESTINATION_DIRECTORY<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"data/processed"</span> TRAIN_DATA_IMAGES<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"images-gen-1-8"</span> TRAIN_DATA_LABELS<span class="token punctuation">:</span> <span class="token builtin">str</span> <span class="token operator">=</span> <span class="token string">"stats/pokemon-gen-1-8.csv"</span> MODEL_TEST_SIZE<span class="token punctuation">:</span> <span class="token builtin">float</span> <span class="token operator">=</span> <span class="token number">0.2</span> MODEL_LEARNING_RATE<span class="token punctuation">:</span> <span class="token builtin">float</span> <span class="token operator">=</span> <span class="token number">0.001</span> MODEL_EPOCHS<span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token operator">=</span> <span class="token number">10</span> MODEL_BATCH_SIZE<span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token operator">=</span> <span class="token number">120</span></code></pre></div> <p>Papermill <a href="https://papermill.readthedocs.io/en/latest/usage-parameterize.html#designate-parameters-for-a-cell" target="_blank" rel="nofollow noopener noreferrer">needs a <code>parameters</code> tag</a> to recognize this cell as the one containing our parameters. To add this tag to the cell, we go to <code>View / Cell Toolbar</code> and enable <code>Tags</code>. Afterward, we type in <code>parameters</code> in the top right corner of our cell.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 491px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/44511a7489df0e34ed81cc441214f754/5c810/jupyter-tags.png" alt="Enabling Tags for Jupyter Notebook cells" title="Enabling Tags for Jupyter Notebook cells" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Enabling Tags for Jupyter Notebook cells</em></p> <admon type="tip"> <p>In case you’re running the notebook straight from VS Code, please be aware that <a href="https://github.com/microsoft/vscode-jupyter-powertoys/issues/48" target="_blank" rel="nofollow noopener noreferrer">editing cell tags is not natively supported here</a>. You can use the <a href="https://marketplace.visualstudio.com/items?itemName=ms-toolsai.vscode-jupyter-cell-tags" target="_blank" rel="nofollow noopener noreferrer">Jupyter Cell Tags extension</a> or the editor in Jupyter Server as shown above.</p> </admon> <p>We can now replace hard-coded parameters in our notebook with references to the variables we defined. For example, we change the following section of code like so:</p> <div class="gatsby-highlight" data-language="diff"><pre class="language-diff-python"><code class="language-diff-python">estimator = model.fit(X_train, y_train, <span class="token unchanged language-python"><span class="token prefix unchanged"> </span> validation_data <span class="token operator">=</span> <span class="token punctuation">(</span>X_test<span class="token punctuation">,</span> y_test<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token prefix unchanged"> </span> class_weight <span class="token operator">=</span> calculate_class_weights<span class="token punctuation">(</span>y_train<span class="token punctuation">)</span><span class="token punctuation">,</span> </span><span class="token deleted-sign deleted language-python"><span class="token prefix deleted">-</span> epochs <span class="token operator">=</span> <span class="token number">10</span><span class="token punctuation">,</span> </span><span class="token inserted-sign inserted language-python"><span class="token prefix inserted">+</span> epochs <span class="token operator">=</span> MODEL_EPOCHS<span class="token punctuation">,</span> </span><span class="token deleted-sign deleted language-python"><span class="token prefix deleted">-</span> batch_size <span class="token operator">=</span> <span class="token number">120</span><span class="token punctuation">,</span> </span><span class="token inserted-sign inserted language-python"><span class="token prefix inserted">+</span> batch_size <span class="token operator">=</span> MODEL_BATCH_SIZE<span class="token punctuation">,</span> </span><span class="token unchanged language-python"><span class="token prefix unchanged"> </span> verbose <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">)</span></span></code></pre></div> <p>Now we can run our notebook through Papermill with changed parameters:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ papermill <span class="token punctuation">\</span> notebooks/pokemon_classifier.ipynb <span class="token punctuation">\</span> outputs/completed_notebook.ipynb <span class="token punctuation">\</span> <span class="token parameter variable">-p</span> MODEL_EPOCHS <span class="token number">15</span> <span class="token punctuation">\</span> <span class="token parameter variable">-p</span> MODEL_BATCH_SIZE <span class="token number">160</span></code></pre></div> <h3 id="create-dvc-pipeline" style="position:relative;">Create DVC pipeline<a href="#create-dvc-pipeline" aria-label="create dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>With our parameterized notebook in place, we can create our pipeline with DVC. Our pipeline consists of stages (in this case: one stage) and has inputs and outputs. For our model, the inputs will be the required datasets and our notebook. The pipeline’s outputs will be the model itself, a graph showing the training process, and a confusion matrix for the model’s predictions.</p> <p>Additionally, a pipeline can have metrics and plots. We will define several metrics that allow us to compare model performance across different experiments, such as accuracy and F1 scores.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f6c1b1df7a76c455086a0ebc527b7c66/39600/pipeline-components.png" alt="All of the pipeline components" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Our inputs, pipeline, and outputs</em></p> <p>A DVC pipeline is defined in a dedicated <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. We can add stages manually in this file, which you generally want to do when building complex, multi-stage pipelines. However, to get started, it’s probably easier if we use the <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command. We use the <code>-n</code> option to provide a name for the stage, the <code>-d</code> option to specify our dependencies, the <code>-o</code> option to specify our outputs, and the <code>-M</code> option to specify our metrics file. Lastly, we type in the command that DVC should execute for that stage:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc stage add</span> <span class="token punctuation">\</span> <span class="token parameter variable">-n</span> run_notebook <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> notebooks/pokemon_classifier.ipynb <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> data/external/images-gen-1-8 <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> data/external/stats/pokemon-gen-1-8.csv <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> data/processed/pokemon <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> data/processed/pokemon.csv <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> data/processed/pokemon-with-image-paths.csv <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> outputs/model.pckl <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> outputs/confusion_matrix.png <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> outputs/train_history.png <span class="token punctuation">\</span> <span class="token parameter variable">-M</span> outputs/metrics.yaml <span class="token punctuation">\</span> papermill notebooks/pokemon_classifier.ipynb outputs/pokemon_classifier_out.ipynb</span></code></pre></div> <p>The uppercase <code>-M</code> option (as opposed to the lowercase <code>-m</code> option) tells DVC not to track the resulting metrics file. We typically want to do this with metrics because the files are small enough to be tracked by Git directly.</p> <p>The resulting <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> looks as follows:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">run_notebook</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token scalar string"> papermill notebooks/pokemon_classifier.ipynb outputs/pokemon_classifier_out.ipynb</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> notebooks/pokemon_classifier.ipynb <span class="token punctuation">-</span> data/external/images<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span><span class="token number">8</span> <span class="token punctuation">-</span> data/external/stats/pokemon<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span>8.csv <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/processed/pokemon <span class="token punctuation">-</span> data/processed/pokemon.csv <span class="token punctuation">-</span> data/processed/pokemon<span class="token punctuation">-</span>with<span class="token punctuation">-</span>image<span class="token punctuation">-</span>paths.csv <span class="token punctuation">-</span> outputs/model.pckl <span class="token punctuation">-</span> outputs/confusion_matrix.png <span class="token punctuation">-</span> outputs/train_history.png <span class="token key atrule">metrics</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">outputs/metrics.yaml</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></code></pre></div> <p>With that, we have our pipeline in its basic form! We can run the pipeline with the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command, and DVC will execute our notebook. We have yet to specify our parameters, however. Otherwise, every pipeline <em>run</em> would utilize the default parameters we defined in our notebook.</p> <p>DVC parses in the values for parameters from another YAML file: <code>params.yaml</code>. We can declare the same parameters here that we previously incorporated in our notebook. To provide a little bit of structure, let’s also group them in sections:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">base</span><span class="token punctuation">:</span> <span class="token key atrule">seed</span><span class="token punctuation">:</span> <span class="token number">42</span> <span class="token key atrule">pokemon_type_train</span><span class="token punctuation">:</span> <span class="token string">'Water'</span> <span class="token key atrule">data_preprocess</span><span class="token punctuation">:</span> <span class="token key atrule">source_directory</span><span class="token punctuation">:</span> <span class="token string">'data/external'</span> <span class="token key atrule">destination_directory</span><span class="token punctuation">:</span> <span class="token string">'data/processed'</span> <span class="token key atrule">dataset_labels</span><span class="token punctuation">:</span> <span class="token string">'stats/pokemon-gen-1-8.csv'</span> <span class="token key atrule">dataset_images</span><span class="token punctuation">:</span> <span class="token string">'images-gen-1-8'</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">test_size</span><span class="token punctuation">:</span> <span class="token number">0.2</span> <span class="token key atrule">learning_rate</span><span class="token punctuation">:</span> <span class="token number">0.001</span> <span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">15</span> <span class="token key atrule">batch_size</span><span class="token punctuation">:</span> <span class="token number">120</span></code></pre></div> <p>We can now update our pipeline in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> to read the parameters from <code>params.yaml</code>. The file is detected automatically by DVC and we can parse the values into the <code>papermill</code> command with the <code>-p</code> option. The result will look like this:</p> <div class="gatsby-highlight" data-language="diff"><pre class="language-diff-yaml"><code class="language-diff-yaml">stages: <span class="token unchanged language-yaml"><span class="token prefix unchanged"> </span> <span class="token key atrule">run_notebook</span><span class="token punctuation">:</span> <span class="token prefix unchanged"> </span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token scalar string"> <span class="token prefix unchanged"> </span> papermill <span class="token prefix unchanged"> </span> notebooks/pokemon_classifier.ipynb <span class="token prefix unchanged"> </span> outputs/pokemon_classifier_out.ipynb</span> </span><span class="token inserted-sign inserted language-yaml"><span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p SEED $<span class="token punctuation">{</span>base.seed<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p POKEMON_TYPE_TRAIN $<span class="token punctuation">{</span>base.pokemon_type_train<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p SOURCE_DIRECTORY $<span class="token punctuation">{</span>data_preprocess.source_directory<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p DESTINATION_DIRECTORY $<span class="token punctuation">{</span>data_preprocess.destination_directory<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p TRAIN_DATA_IMAGES $<span class="token punctuation">{</span>data_preprocess.dataset_images<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p TRAIN_DATA_LABELS $<span class="token punctuation">{</span>data_preprocess.dataset_labels<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_TEST_SIZE $<span class="token punctuation">{</span>train.test_size<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_LEARNING_RATE $<span class="token punctuation">{</span>train.learning_rate<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_EPOCHS $<span class="token punctuation">{</span>train.epochs<span class="token punctuation">}</span> <span class="token prefix inserted">+</span> <span class="token punctuation">-</span>p MODEL_BATCH_SIZE $<span class="token punctuation">{</span>train.batch_size<span class="token punctuation">}</span> </span><span class="token unchanged language-yaml"><span class="token prefix unchanged"> </span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> notebooks/pokemon_classifier.ipynb <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/external/images<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span><span class="token number">8</span> <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/external/stats/pokemon<span class="token punctuation">-</span>gen<span class="token punctuation">-</span>1<span class="token punctuation">-</span>8.csv <span class="token prefix unchanged"> </span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/processed/pokemon <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/processed/pokemon.csv <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> data/processed/pokemon<span class="token punctuation">-</span>with<span class="token punctuation">-</span>image<span class="token punctuation">-</span>paths.csv <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> outputs/model.pckl <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> outputs/confusion_matrix.png <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> outputs/train_history.png <span class="token prefix unchanged"> </span> <span class="token key atrule">metrics</span><span class="token punctuation">:</span> <span class="token prefix unchanged"> </span> <span class="token punctuation">-</span> <span class="token key atrule">outputs/metrics.yaml</span><span class="token punctuation">:</span> <span class="token prefix unchanged"> </span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></span></code></pre></div> <p>And with that, we have our pipeline ready for use! Before we start running experiments with it, however, let’s ensure everything is tracked and versioned properly so we can reproduce our experiments later on.</p> <h3 id="version-our-data-models-and-plots-with-dvc" style="position:relative;">Version our data, models, and plots with DVC<a href="#version-our-data-models-and-plots-with-dvc" aria-label="version our data models and plots with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>As we discussed earlier, we want to version every component of our experiments to achieve true reproducibility: code, parameters, data, models, metrics, and plots. We want to version small files (usually text) with Git and larger files with DVC. That principle gives us the following split between the two:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2dbaf41ab7b1a61e962fd0a331d57002/39600/versioning-components.png" alt="Versioning all of the pipeline components" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Every component of our experiment is versioned either by Git or DVC</em></p> <p>When we created our pipeline in the previous step, DVC automatically started tracking the outputs we defined and listed them in our <code>.gitignore</code>. On the other hand, the metrics file is ignored by DVC and still tracked by Git (<code>cache: false</code>), because we added it with the upper case <code>-M</code> option. If we wanted to track the metrics with DVC as well, we could change this in our <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p> <p>There is one last output of the pipeline we haven't yet accounted for: <code>outputs/completed_notebook.ipynb</code>. Because it's a rather large file that we don't really need for anything, we can simply add it to our <code>.gitignore</code>. After all, we can always reproduce it by rerunning our pipeline!</p> <p>With that, every component (of importance) in our project is now versioned by Git or DVC. That means we now have the reproducible pipeline we set out to create: we can go back to any experiment and get the precise combination of code, data, parameters, and results. This will make it much easier to conduct experiments, find the best-performing model, and collaborate with teammates.</p> <p>Let’s take our pipeline for a ride and run some experiments!</p> <admon type="info"> <p>At this point, we would typically also configure our DVC remote to make sure our versioning not only exists on our local system. This is outside the scope of this guide, but you can find guides for <a href="https://iterative.ai/blog/using-gcp-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Google Cloud Platform</a>, <a href="https://iterative.ai/blog/azure-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Azure Blob Storage</a>, and <a href="https://iterative.ai/blog/aws-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Amazon Web Services</a> on our blog.</p> </admon> <h3 id="running-experiments" style="position:relative;">Running experiments<a href="#running-experiments" aria-label="running experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are two ways in which we can run experiments with our newly defined pipeline. The first one utilizes our good ol’ command line interface. We can use <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> to run an experiment after we have changed the parameters in <code>params.yaml</code>, or we could change the parameters in the command itself with the <code>-S</code> option. The following command would trigger a new experiment with an updated number of epochs, for example:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'train.epochs=25'</span></span></code></pre></div> <p>However, if we’re using <a href="https://code.visualstudio.com/" target="_blank" rel="nofollow noopener noreferrer">Visual Studio Code</a> as our IDE of choice, we can also use <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">the DVC extension</a> to run and visualize experiments through a graphical user interface. We can go to the experiment table and, from there, modify, queue, and run new experiments. The results will be shown below each other, providing an easy way to compare their outcomes.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-10-24/dvc-vscode-extension-3d9c6b635560d7ec3f20532230bca57d.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <p>Now, all there’s left to do is to start experimenting and find the best possible model! Once we have drawn our conclusions from experimenting, we can pick the best-performing experiment and start using the model it put forth.</p> <h2 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Throughout this guide, we transformed a Jupyter Notebook into a codified pipeline for reproducible experiments. We used Papermill to parameterize our notebook so that we could run it with a single command and then created a pipeline in DVC to run that command for us.</p> <admon type="info"> <p>The result of following the guide can be found in <a href="https://github.com/iterative/example-pokemon-classifier/tree/papermill-dvc" target="_blank" rel="nofollow noopener noreferrer">the <code>papermill-dvc</code> branch of the example project</a>.</p> </admon> <p>With our DVC pipeline tracking and versioning every experiment, we can discover which combination of code, data, and parameters provides the best results. Comparing experiments is especially easy when using the experiment table in the DVC extension for Visual Studio Code.</p> <p>From this point onwards, we can still make a few improvements to our pipeline. For one, we could leverage DVC to generate our plots rather than render them as images from our notebook. This would allow us to compare experiments visually in a similar manner to how DVC can visualize an experiments table. To learn more about this, <a href="https://dvc.org/doc/command-reference/plots" target="_blank" rel="nofollow noopener noreferrer">please refer to the docs</a>.</p> <p>Another improvement would be to break up our single-stage pipeline into different stages with coherent units of code (e.g., preprocess, train, and evaluate). Our current implementation runs the entire notebook for every single experiment, even though the data preprocessing doesn’t change between experiments. With a multi-stage pipeline, DVC could track changes to the in- and outputs for every stage and automatically determine which stages it can skip because nothing has changed. This saves time and resources, especially in computationally heavy projects.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc dag</span> </span>+-------------------+ | data/external.dvc | +-------------------+ * * * +-----------------+ | data_preprocess | +-----------------+ * * * +-----------+ | data_load | +-----------+ * * * +-------+ | train | +-------+ * * * +----------+ | evaluate | +----------+</code></pre></div> <p>If you want to learn how to transform a notebook into a multi-stage pipeline, I recommend taking a look at our course: <a href="https://learn.dvc.org/course/data-scientist-path" target="_blank" rel="nofollow noopener noreferrer">Iterative tools for Data Scientists and Analysts</a>. It is completely free to follow, and module 3 covers this process in depth.</p> <p>We might also write a future guide about this, so let us know if you would be interested in seeing this content. Make sure to join <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">our Discord server</a> if you have any questions or want to discuss this post further!</p>https://dvc.org/blog/october-heartbeathttps://dvc.org/blog/october-heartbeatThu, 20 Oct 2022 00:00:00 GMT<p>Welcome to October! As the days grow shorter or longer depending on your hemisphere, we bring you the latest and greatest from the Iterative Community.</p> <h1 id="in-ai-news" style="position:relative;">In AI News<a href="#in-ai-news" aria-label="in ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="andrew-ng-at-intels-innovation-conference---democratizing-ai-through-data-centric-ai" style="position:relative;">Andrew Ng at Intel's Innovation Conference - Democratizing AI through Data-Centric AI<a href="#andrew-ng-at-intels-innovation-conference---democratizing-ai-through-data-centric-ai" aria-label="andrew ng at intels innovation conference democratizing ai through data centric ai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/G3MaIMrR6Ms?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>At <a href="https://www.intel.com/content/www/us/en/events/on-event-series/innovation.html" target="_blank" rel="nofollow noopener noreferrer">Intel’s Innovation</a> conference, <a href="https://www.linkedin.com/in/andrewyng/" target="_blank" rel="nofollow noopener noreferrer"><strong>Andrew Ng</strong></a> gave a keynote on democratizing AI. He posits that while large companies have embraced AI, most smaller companies outside of the consumer-based domains still struggle. He provides two main reasons for this: small datasets and customization.</p> <p>According to Ng, data-centric AI will be the key to unlocking that potential, forcing a paradigm shift away from code-centric AI. In this scenario, people could take mostly ready-built ML tech and focus on the data to ensure it captures all necessary domain knowledge.</p> <p>For example, two companies that produce cornflakes and medication could take the same ML model and train it on their respective datasets. As long as they have the right tools and practices and provide a domain representative dataset, the same model can reproduce effective results. If you want to see some of the tools Ng uses, make sure to check out his keynote.</p> <p>What do you think? Does the average data scientist need a different set of skills in the near future? Are you in one of these smaller industries that are starting to embrace AI? We'd love to read your thoughts! Join us in our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">discussion of this topic on Discord</a>!</p> <h2 id="blueprint-for-an-ai-bill-of-rights" style="position:relative;">Blueprint for an AI Bill of Rights<a href="#blueprint-for-an-ai-bill-of-rights" aria-label="blueprint for an ai bill of rights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/826ad49e017a3af5984d9c6cf494e987/39600/blue-print.png" alt="Blueprint for an AI Bill of Rights" title="White House Blueprint for an AI Bill of Rights" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> If you will recall from <a href="https://iterative.ai/blog/september-22-heartbeat#european-ai-act" target="_blank" rel="nofollow noopener noreferrer">last month's Heartbeat</a> we called to your attention the EU AI Act. This act proposes new rules that would require that open source developers adhere to guidelines across a spectrum of categories including risk management, data governance, technical documentation and transparency, standards and accuracy, and cyber security. Not to be outdone, the US White House declared a <a href="https://www.whitehouse.gov/ostp/ai-bill-of-rights/" target="_blank" rel="nofollow noopener noreferrer">Blue Print for an AI Bill of Rights</a>. <a href="https://www.whitehouse.gov/ostp/" target="_blank" rel="nofollow noopener noreferrer">The White House Office of Science and Technology Policy (OSTP)</a> has defined 5 categories for these rights:</p> <ol> <li>Safe and Effective Systems</li> <li>Algorithmic Discrimination Protection</li> <li>Data Privacy</li> <li>Notice and Explanation</li> <li>Human Alternatives, Consideration, and Fallback</li> </ol> <p>There's definitely some overlap here with the EU AI Act and some catching up with Data Privacy in the mix. There's lots to unpack, compare, and contrast on scope and philosophy between the two. It's nice to see that major attention is given to these issues.</p> <p>We could think of the relationship between AI rights and Andrew Ng's talk in the sense of the AI space maturing. To Andrew Ng's points, as we move from the frenzied all-important model development to an understanding of the need for a data-centric approach and this democratization, we are changing the focus to enable us to adequately address these hard and important issues. Improving the efficiency of tooling will help with this too. That's why we are here.</p> <p>What do you think? Do the efficiencies we are gaining open up room for improved time/attention to bake protections into the process or am I too hopeful? Head to <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> and share your thoughts!</p> <h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e798beca6e65dd684e680b7d07318b57/03346/hydra.jpg" alt="DVC-Hydra integration" title="DVC-Hydra integration" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>AI generated image of rainbow feathered dragon (DeeVee + Hydra)</em></p> <h2 id="dvc-hydra-integration" style="position:relative;">DVC-Hydra Integration<a href="#dvc-hydra-integration" aria-label="dvc hydra integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Did you hear? DVC has a new integration with Hydra. Now you can use Hydra composition to configure your DVC experiments. You can also apend and remove parameters on the fly as well as do a grid search of parameters. Random search functionlity is coming, <a href="https://github.com/iterative/dvc/issues/8258" target="_blank" rel="nofollow noopener noreferrer">weigh in on the issue here.</a> Find out more in <a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia's</strong></a> <a href="https://iterative.ai/blog/dvc-hydra-integration" target="_blank" rel="nofollow noopener noreferrer">blog post</a>.</p> <h2 id="october-meetup" style="position:relative;">October Meetup<a href="#october-meetup" aria-label="october meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you missed the October Meetup with <a href="https://www.linkedin.com/in/nadia-nahar-iit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nadia Nahar</strong></a> presenting her team's research on <em>Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process</em> don't worry, there's a video! Catch it below!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/FKdVSNfnD_M?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="november-meetup" style="position:relative;">November Meetup<a href="#november-meetup" aria-label="november meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Join us for our next meetup on November 16th. We will have <a href="https://www.linkedin.com/in/dim25/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmytro Filatov</strong></a> of <a href="https://deepxhub.com/" target="_blank" rel="nofollow noopener noreferrer">DeepX</a> presenting <em>Continous Computer Vision with DVC and CML</em> and <a href="https://www.linkedin.com/in/jelle-bouwman/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jelle Bouwman</strong></a> demoing Iterative Studio Model Registry. Be sure to register <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289088542/" target="_blank" rel="nofollow noopener noreferrer">here!</a></p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/events/289088542/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Continuous Computer Vision with DVC and CML plus Iterative Studio Model Registry Demo</h4> <div class="elp-description">Join us on November 16th. Come see the possibilities with DVC, CML, and Iterative Studio Model Registry!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-10-20/meetup-6b29c88388fd183f67d88dad40d5c671.png" alt="Continuous Computer Vision with DVC and CML plus Iterative Studio Model Registry Demo"> </div> </a> </section> <p></p> <h2 id="alex-kim---cicd-for-machine-learning-webinar-with-odsc" style="position:relative;">Alex Kim - CI/CD for Machine Learning Webinar with ODSC<a href="#alex-kim---cicd-for-machine-learning-webinar-with-odsc" aria-label="alex kim cicd for machine learning webinar with odsc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Join <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> on November 30th with <a href="https://opendatascience.com/" target="_blank" rel="nofollow noopener noreferrer">ODSC</a> to learn about CI/CD for Machine Learning. This webinar shares how CML is a project to help ML and data science practitioners automate their ML model training and model evaluation, using best practices and tools from software engineering, such as GitLab CI/CD (as well as GitHub Actions and BitBucket Pipelines). The idea is to automatically train your model and test it in a production-like environment every time your data or code changes. In this talk, you'll learn how to:</p> <ul> <li>Automatically allocate cloud instances (AWS, Azure, GCP) to train ML models. And automatically shut the instance down when training is over</li> <li>Automatically generate reports with graphs and tables in pull/merge requests to summarize your model's performance, using any visualization library</li> <li>Transfer data between cloud storage and computing instances with DVC</li> <li>Customize your automation workflow with GitLab CI/CD</li> </ul> <p>Sign up for the talk <a href="https://register.gotowebinar.com/register/6817359546805649932?utm_campaign=Webinars&utm_source=Community&utm_medium=Community&utm_content=Webinar%2030th%20Nov%202022" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e98308d2e1d9eae00586b4b24266e708/39600/alex-kim.png" alt="Alex Kim ODSC webinar" title="Alex Kim ODSC webinar" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Alex Kim webinar CI/CD for Machine Learning for ODSC (<a href="https://register.gotowebinar.com/register/6817359546805649932?utm_campaign=Webinars&utm_source=Community&utm_medium=Community&utm_content=Webinar%2030th%20Nov%202022" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="its-hacktoberfest" style="position:relative;">It's Hacktoberfest!<a href="#its-hacktoberfest" aria-label="its hacktoberfest permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 200px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/47bd367f4f623fc9f46c4ba7fc706e51/39600/hacktoberfest.png" alt="Iterative Hacktoberfest" title="Iterative Hacktoberfest" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> It's Hacktoberfest month and we are participating! Find out all the information in <a href="https://twitter.com/mertbozkirr" target="_blank" rel="nofollow noopener noreferrer"><strong>Mert Bozkir's</strong></a> <a href="https://iterative.ai/blog/iterative-x-hacktoberfest-2022" target="_blank" rel="nofollow noopener noreferrer">blog post</a>. But if you just want to jump in, find all the open HackToBerFest issues <a href="https://github.com/search?o=desc&q=org%3Aiterative+label%3Ahacktoberfest&s=comments&state=open&type=Issues" target="_blank" rel="nofollow noopener noreferrer">here.</a> Follow along in the <code>#hacktoberfest</code> channel in Discord to keep up to date for the rest of the month and be sure to read next month's Heartbeat to learn of the contributions!</p> <h2 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/ivan-longin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ivan Longin</strong></a> joins us as a Senior Software Engineer on the Iterative Studio team from Zadar, Croatia. When Ivan's not working he likes to spend time doing outdoor activities, swimming in good weather, and or just walking or often running after his one-year-old! Been there three times over! ❤️ Welcome Ivan!</p> <h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>This month was full of great content. We wanted to give a shout-out to all of it, so we are trying out a more abbreviated list.<br> Thanks to all these amazing Community members that are sharing their knowledge! 🚀</p> <h2 id="dvc" style="position:relative;">DVC<a href="#dvc" aria-label="dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="data-management" style="position:relative;">Data management<a href="#data-management" aria-label="data management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li><a href="https://towardsdatascience.com/data-and-machine-learning-model-versioning-with-dvc-34fdadd06b15" target="_blank" rel="nofollow noopener noreferrer">Data and Machine Learning Model Versioning with DVC</a> by <a href="https://www.linkedin.com/in/marcellusrubenwinastwan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ruben Winastwan</strong></a> Nice visuals! ⭐️</li> <li>A great guide from <a href="https://www.linkedin.com/in/wmeints/" target="_blank" rel="nofollow noopener noreferrer"><strong>Willem Meints</strong></a> - <a href="https://fizzylogic.nl/2022/10/14/managing-machine-learning-datasets-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Managing Machine Learning Datasets with DVC.</a> Also, find his <a href="https://twitter.com/willem_meints/status/1580898467097980932?s=20&t=SD8k9hZ7ygzEFlGBNTyJSA" target="_blank" rel="nofollow noopener noreferrer">Tweets on Twitter</a></li> <li><a href="https://www.linkedin.com/in/jorgehabibnamour/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jorge Namour</strong></a> will give a Webinar on <a href="https://www.facebook.com/facet.unt/posts/pfbid03ABqt5v1tUhRJJowSZgvjaYdFYfyirxGu9aph6LstYu8rVPJsYeuTBPio9srMn4hl" target="_blank" rel="nofollow noopener noreferrer">Tracking Data with Git + DVC</a> en Español on October 27th <a href="https://www.youtube.com/watch?v=pYLEf9FsFic" target="_blank" rel="nofollow noopener noreferrer">at this YouTube link.</a></li> <li>Some GitHub goodness: <a href="https://github.com/datarootsio/tutorial-mlops" target="_blank" rel="nofollow noopener noreferrer">MLOps - tutorial with DVC, MLFlow, and Pycaret</a> from <a href="https://github.com/murilo-cunha" target="_blank" rel="nofollow noopener noreferrer"><strong>Murilo Cunha</strong></a>, <a href="https://github.com/vspara" target="_blank" rel="nofollow noopener noreferrer"><strong>vspara</strong></a>, and <a href="https://github.com/virginiemar" target="_blank" rel="nofollow noopener noreferrer"><strong>virginiemar</strong></a></li> <li>Updated Udemy course that includes DVC - <a href="https://www.udemy.com/course/complete-mlops-bootcamp-from-zero-to-hero-in-python-2022/?utm_source=aff-campaign&utm_medium=udemyads&LSNPUBID=McqLy3Lfq44&ranMID=47901&ranEAID=McqLy3Lfq44&ranSiteID=McqLy3Lfq44-MTrInsWY4oEt0kDxUzExAg" target="_blank" rel="nofollow noopener noreferrer">Complete MLOps Bootcamp | From Zero to Hero in Python 2022</a></li> <li><a href="https://mathdatasimplified.com/2022/10/07/how-to-version-control-your-data-and-models-with-dvc/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-version-control-your-data-and-models-with-dvc" target="_blank" rel="nofollow noopener noreferrer">How to Version Control Your Data and Models with DVC</a> (<strong>Video included</strong>) by <a href="https://www.linkedin.com/in/khuyen-tran-1401/" target="_blank" rel="nofollow noopener noreferrer"><strong>Khuyen Tran</strong></a> Dig the DVC color-themed command line! 🤩</li> <li>NLP and CV with DVC! <a href="https://pub.towardsai.net/from-unet-to-bert-extraction-of-important-information-from-scientific-papers-ef0f737e45e9" target="_blank" rel="nofollow noopener noreferrer">From UNet to BERT: Extraction of Important Information from Scientific Papers</a> by <a href="https://www.linkedin.com/in/eman-shemsu-83473684/" target="_blank" rel="nofollow noopener noreferrer"><strong>Eman Shemsu</strong></a></li> <li><a href="https://minimin2.tistory.com/m/185" target="_blank" rel="nofollow noopener noreferrer">[MLOps] How to use DVC (Data Version Control) data versioning</a> in Korean 🇰🇷 by Minimin2</li> </ul> <h3 id="data-pipelines" style="position:relative;">Data Pipelines<a href="#data-pipelines" aria-label="data pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>Great guide from <a href="https://www.linkedin.com/in/deborahmesquita/" target="_blank" rel="nofollow noopener noreferrer"><strong>Déborah Mesquita</strong></a> - <a href="https://towardsdatascience.com/the-ultimate-guide-to-building-maintainable-machine-learning-pipelines-using-dvc-a976907b2a1b" target="_blank" rel="nofollow noopener noreferrer">The ultimate guide to building maintainable Machine Learning pipelines using DVC</a> (<strong>Video Included</strong>) ⭐️</li> <li>Also from <a href="https://www.linkedin.com/in/khuyen-tran-1401/" target="_blank" rel="nofollow noopener noreferrer"><strong>Khuyen Tran</strong></a>: <a href="https://towardsdatascience.com/create-a-maintainable-data-pipeline-with-prefect-and-dvc-1d691ea5bcea" target="_blank" rel="nofollow noopener noreferrer">Create a Maintainable Data Pipeline with Prefect and DVC</a></li> </ul> <h3 id="experimentation" style="position:relative;">Experimentation<a href="#experimentation" aria-label="experimentation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>In-depth tutorial covering Data Management, Pipelines and Experimentation with DVC <a href="https://www.linkedin.com/in/givashkevich/" target="_blank" rel="nofollow noopener noreferrer"><strong>Gleb Ivashkevich</strong></a> - <a href="https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b" target="_blank" rel="nofollow noopener noreferrer">Creating Reproducible data Science Workflows with DVC</a> ⭐️</li> <li><a href="https://iblog.ridge-i.com/entry/2022/10/11/102033" target="_blank" rel="nofollow noopener noreferrer">Data Version Control (DVC): Beginner's Guide</a> by <a href="https://www.linkedin.com/in/ajmain-inqiad-alam/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ajmain Inqiad Alam</strong></a></li> </ul> <h3 id="other-mentions" style="position:relative;">Other mentions<a href="#other-mentions" aria-label="other mentions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>There is now a <a href="https://en.wikipedia.org/w/index.php?title=Data_Version_Control&diff=1114227867&oldid=1114227707" target="_blank" rel="nofollow noopener noreferrer"><strong>DVC Wikipedia page!</strong></a></li> <li>Great discussion around challenges in Machine learning from <a href="https://medium.com/@dvsamchuk" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmytro Samchuk</strong></a> - <a href="https://medium.com/@dvsamchuk/machine-learning-done-right-in-your-business-130acd3a093e" target="_blank" rel="nofollow noopener noreferrer">Machine Learning Done Right in Your Business.</a></li> </ul> <h2 id="cml" style="position:relative;">CML<a href="#cml" aria-label="cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li>CML in research! 🤩 <a href="https://arxiv.org/abs/2209.11453" target="_blank" rel="nofollow noopener noreferrer">A Preliminary Investigation of MLOps Practices in GitHub</a>, <a href="https://arxiv.org/pdf/2209.11453.pdf" target="_blank" rel="nofollow noopener noreferrer">PDF</a> by <a href="https://www.linkedin.com/in/fcalefato/" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabio Calefato</strong></a>, <a href="https://www.linkedin.com/in/lanubile/" target="_blank" rel="nofollow noopener noreferrer"><strong>Filippo Lanubile</strong></a>, and <a href="https://www.linkedin.com/in/luigi-quaranta-007a6112a/" target="_blank" rel="nofollow noopener noreferrer"><strong>Luigi Quaranta</strong></a></li> <li>Part III in <a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Upson's</strong>:</a> series <a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-iii-f56a29c428f3?source=rss----72ea48936cdc---4" target="_blank" rel="nofollow noopener noreferrer">MLOps for Conversational AI with Rasa, DVC, and CML (Part III)!</a></li> <li><a href="https://mail-redir.mention.com/api/url?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1cmwiOiJodHRwczpcL1wvZ2l0aHViLmNvbVwvcjBmMVwvZGF0YXNjaWVuY2VcL2NvbW1pdFwvNzMzMTU0YTdjYWJlOGY2MDRlMmMwYzQwOWI2NzRhY2QyODg3NWJhMCIsImFjY291bnRfaWQiOjEwMDMyNDIsImFsZXJ0X2lkIjoyNDM1MTgwLCJzb3VyY2VfaWQiOjY3LCJtZW50aW9uX2lkIjoxNDAzNzIzOTkwMzV9.AQcSYPdGzKBJemSgTDlyPcSeWL7dJTIlULRJaDqDVRg" target="_blank" rel="nofollow noopener noreferrer">Zen ML adds CML to its Awesome Data Science with Python list.</a> 😎</li> <li><a href="https://www.linkedin.com/in/alessandro-paticchio/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alessandro Paticchio</strong></a> (Casavo) <a href="https://medium.com/casavo/using-ai-to-automatically-estimate-the-status-of-a-fa%C3%A7ade-c84c2a90549e" target="_blank" rel="nofollow noopener noreferrer">Using AI to automatically estimate the status of a façade.</a> ⭐️</li> <li><a href="https://cmtech.live/2022/08/31/ci-cd-for-machine-learning-model-training-with-github-actions-by-zoumana-keita-aug-2022/" target="_blank" rel="nofollow noopener noreferrer">CI/CD for Machine Learning Model Training with GitHub Actions</a> by <a href="https://www.linkedin.com/in/zoumana-keita/" target="_blank" rel="nofollow noopener noreferrer"><strong>Zoumana Keita</strong></a></li> </ul> <h2 id="mlem" style="position:relative;">MLEM<a href="#mlem" aria-label="mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://www.instagram.com/tv/Cjnl8CuK2K0/" target="_blank" rel="nofollow noopener noreferrer">MLEM Instagram</a>. If you're on IG, follow <a href="https://www.instagram.com/the_ai_dot/" target="_blank" rel="nofollow noopener noreferrer">the_ai_dot</a> for AI & ML New, Tools & Libraries</li> </ul> <h2 id="️-tweet-love" style="position:relative;">❤️ Tweet Love<a href="#%EF%B8%8F-tweet-love" aria-label="️ tweet love permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>I had a really hard time choosing this month, but I was excited to see this Tweet from <a href="https://twitter.com/nsorros" target="_blank" rel="nofollow noopener noreferrer"><strong>Nick Sorros</strong></a> announcing the post from his colleague <a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer">Matt Upson</a>.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">A little belated but neverthless hugely interesting post by my co founders <a href="https://twitter.com/m_a_upson">@m_a_upson</a> in which he touches on some core tools we use at Mantis like <a href="https://twitter.com/DVCorg">@DVCorg</a>, <a href="https://twitter.com/Rasa_HQ">@Rasa_HQ</a> and continuous machine learning.<br><br>It comes with code 💻 so you can take some of what you will read and use 🚀 <a href="https://t.co/PHgLXtvckz">https://t.co/PHgLXtvckz</a></p>— Nick Sorros (@nsorros) <a href="https://twitter.com/nsorros/status/1571844138575843331">September 19, 2022</a></blockquote> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/iterative-x-hacktoberfest-2022https://dvc.org/blog/iterative-x-hacktoberfest-2022Tue, 11 Oct 2022 00:00:00 GMT<p>Hacktoberfest is DigitalOcean’s annual event that encourages people to contribute to open source throughout October. Hacktoberfest is all about giving back to the community by contributing to open-source projects. The main point of Hacktoberfest is encouraging new open-source contributors whether you’re a seasoned contributor or looking for projects to contribute to for the first time, you’re welcome to participate!</p> <h2 id="what-is-iterative" style="position:relative;">What is Iterative<a href="#what-is-iterative" aria-label="what is iterative permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Iterative is a remote-first team on a mission to solve the complexity of managing datasets, ML Infrastructure, and ML model lifecycle management. It was started in 2018 by a data scientist and an engineer to fill in the gaps in the machine learning to production. Presently Iterative is growing pretty fast, adoption of the Iterative tools has significantly increased, and we have our contributors to thank (more than 300 in both code and docs) for developing open source projects such as DVC, CML, and MLEM with us.</p> <p align="center"> <img src="https://media.giphy.com/media/wIVA0zh5pt0G5YtcAL/giphy.gif" alt="animated"> </p> <h2 id="quick-start" style="position:relative;">Quick Start<a href="#quick-start" aria-label="quick start permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li>Sign up for Hacktoberfest <a href="https://hacktoberfest.com/auth/" target="_blank" rel="nofollow noopener noreferrer">here</a></li> <li>Find all the Hacktoberfest issues <a href="https://github.com/search?o=desc&q=org%3Aiterative+label%3Ahacktoberfest&s=comments&state=open&type=Issues" target="_blank" rel="nofollow noopener noreferrer">here</a></li> <li>Read the contribution guideline (<a href="https://dvc.org/doc/contributing/core" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, <a href="https://cml.dev/doc/contributing/core" target="_blank" rel="nofollow noopener noreferrer">CML</a>, <a href="https://mlem.ai/doc/contributing/core" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>)</li> <li>Join our <a href="https://discord.gg/5j3uvSnzXb" target="_blank" rel="nofollow noopener noreferrer">Hacktoberfest Discord channel</a> and ask any questions</li> <li>Create a pull request on the related GitHub repository</li> </ul> <h2 id="how-to-participate" style="position:relative;">How to Participate<a href="#how-to-participate" aria-label="how to participate permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The most exciting part about being involved in the open-source community is that no matter how small or big your contributions are, the community will welcome your efforts and collaborate with you positively, sharing feedback and expressing gratitude.</p> <p>If you haven’t started your Hacktoberfest challenge yet, it is just the right time; you have 4 weeks left to submit PRs and get your swag! Here are some important details:</p> <ul> <li>Hacktoberfest is open to everyone in the global community</li> <li>You can sign up anytime between October 1 and October 31. Make sure to sign up on the <a href="https://hacktoberfest.com/" target="_blank" rel="nofollow noopener noreferrer">official Hacktoberfest website</a> for your PRs to count</li> <li>Pull requests can be made in any <a href="https://github.com/topics/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">GitHub</a> project that’s participating in Hacktoberfest (look for the “Hacktoberfest” topic)</li> <li>Project maintainers must accept your pull/merge requests for them to count toward your total</li> <li>Have 4 pull/merge requests accepted between October 1 and October 31 to complete Hacktoberfest</li> </ul> <p>And the special addition from the Iterative team:</p> <ul> <li>Look through the list of <a href="https://github.com/search?o=desc&q=org%3Aiterative+label%3Ahacktoberfest&s=comments&state=open&type=Issues" target="_blank" rel="nofollow noopener noreferrer">Iterative Hacktoberfest tickets</a>.</li> <li>Make a PR to repositories and get our stickers.</li> <li>Close two issues for Iterative and get a special edition T-shirt.</li> </ul> <h3 id="important-rules" style="position:relative;">Important Rules<a href="#important-rules" aria-label="important rules permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>Your pull/merge requests must be within the bounds of Hacktoberfest</li> <li>Your pull/merge requests must not be spammy</li> <li>Your pull/merge requests must be in a repo tagged with the “Hacktoberfest” topic, or be labeled as “Hacktoberfest-accepted”</li> <li>Your pull/merge requests must not be labeled as “invalid”</li> <li>Avoid submitting low-quality pull/merge requests. More details can be found <a href="https://hacktoberfest.com/participation/#:~:text=AVOID%20SUBMITTING%20LOW%2DQUALITY%20PULL/MERGE%20REQUESTS." target="_blank" rel="nofollow noopener noreferrer">here</a></li> </ul> <p>At Iterative our mission is to deliver the best developer experience for machine learning teams by creating an ecosystem of open, modular ML tools. Our tools are built for developers, by developers and we need help from the global - open-source community - to deliver this mission!</p> <p>For all of us who have a heart for open source — let’s discuss, contribute, learn, take the technologies forward and build something great together!</p> <p>Happy hacking!</p> <p align="center"> <img src="https://media.giphy.com/media/LcfBYS8BKhCvK/giphy.gif" alt="animated"> </p> <hr> <p>We are happy to hear from you <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too!</p>https://dvc.org/blog/dvc-hydra-integrationhttps://dvc.org/blog/dvc-hydra-integrationTue, 04 Oct 2022 00:00:00 GMT<p><a href="https://hydra.cc/" target="_blank" rel="nofollow noopener noreferrer">Hydra</a> has become one of the most popular tools for managing the configuration of research projects and complex applications, given its ability for composing and overwriting configuration both from the command line and from files.</p> <p>These features are a great complement to many of the values provided as part of DVC: <a href="https://dvc.org/doc/start/data-management/data-versioning" target="_blank" rel="nofollow noopener noreferrer">data versioning</a>, <a href="https://dvc.org/doc/start/data-management/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">data pipelines</a>, and <a href="https://dvc.org/doc/start/experiment-management/experiments" target="_blank" rel="nofollow noopener noreferrer">experiment management</a>.</p> <p>Therefore, we decided to tackle this by providing a deeper integration: using Hydra internals inside DVC and allowing users to benefit from the best of both tools.</p> <p>In this post, we are going to provide an overview of the benefits that users of both tools can get from the integration.</p> <h1 id="what-dvc-users-gain-from-the-integration" style="position:relative;">What DVC users gain from the integration<a href="#what-dvc-users-gain-from-the-integration" aria-label="what dvc users gain from the integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="use-hydra-composition-to-configure-dvc-experiments" style="position:relative;">Use Hydra composition to configure DVC experiments<a href="#use-hydra-composition-to-configure-dvc-experiments" aria-label="use hydra composition to configure dvc experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-10-04/deevee-band-1a6bd99d9764245325f931b7e987907a.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <p>DVC didn’t provide a way of composing configuration from multiple sources, which can be very convenient in several use cases, like switching between different model architectures. The Hydra docs provide a great overview of <a href="https://hydra.cc/docs/patterns/configuring_experiments/" target="_blank" rel="nofollow noopener noreferrer">common patterns</a> where this composition is useful.</p> <p>DVC can now use Hydra Composition to configure entire DVC pipelines and run DVC experiments.</p> <p>You can learn more about this feature on the <a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra Composition</a> page of the user guide.</p> <h2 id="appending-and-removing-parameters-on-the-fly" style="position:relative;">Appending and removing parameters on the fly<a href="#appending-and-removing-parameters-on-the-fly" aria-label="appending and removing parameters on the fly permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC supported a limited functionality for modifying configuration using <code>exp run --set-param</code>.</p> <p><code>--set-param</code> can now be used with <a href="https://hydra.cc/docs/advanced/override_grammar/basic/" target="_blank" rel="nofollow noopener noreferrer">Hydra’s Basic Override syntax</a> supporting new operations like <em>Appending</em> and <em>Removing</em> parameters for arbitrary parameter files.</p> <p>When Hydra’s composition is enabled, the same syntax can be used to override values in the <a href="https://hydra.cc/docs/tutorials/basic/your_first_app/config_groups/" target="_blank" rel="nofollow noopener noreferrer">Config Groups</a> and <a href="https://hydra.cc/docs/tutorials/basic/your_first_app/defaults/" target="_blank" rel="nofollow noopener noreferrer">Defaults list</a>.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Append new param</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'+trainer.gradient_clip_val=0.001'</span> </span><span class="token comment"># Remove existing param</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'~model.dropout'</span> </span><span class="token comment"># Target arbitrary files</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'train_config.json:+train.weight_decay=0.001'</span> </span><span class="token comment"># Modify the defauls list</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token string">'train/model=efficientnet'</span></span></code></pre></div> <h2 id="grid-search-of-parameters" style="position:relative;">Grid Search of parameters<a href="#grid-search-of-parameters" aria-label="grid search of parameters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC <code>exp run</code> only supported <a href="https://dvc.org/doc/user-guide/experiment-management/running-experiments#the-experiments-queue" target="_blank" rel="nofollow noopener noreferrer">queuing</a> a single experiment at a time.</p> <p><code>exp run --set-param</code> can now use Hydra's <a href="https://hydra.cc/docs/advanced/override_grammar/extended/#choice-sweep" target="_blank" rel="nofollow noopener noreferrer">Choice</a> and <a href="https://hydra.cc/docs/advanced/override_grammar/extended/#range-sweep" target="_blank" rel="nofollow noopener noreferrer">Range</a> syntax for adding multiple experiments to the queue and performing a grid search:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'model.learning_rate=range(0.01, 0.5, 0.01)'</span> <span class="token parameter variable">--queue</span> </span>Queueing with "{'params.yaml': ['model.learning_rate=0.01']}". Queued experiment '84e89be' for future execution. Queueing with "{'params.yaml': ['model.learning_rate=0.02']}". Queued experiment 'd7708fa' for future execution. Queueing with "{'params.yaml': ['model.learning_rate=0.03']}". Queued experiment '5494d5c' for future execution. Queueing with "{'params.yaml': ['model.learning_rate=0.04']}". Queued experiment '2e16c1f' for future execution. Queueing with "{'params.yaml': ['model.learning_rate=0.05']}". Queued experiment '7c7a615' for future execution. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc queue start</span></span></code></pre></div> <h1 id="what-hydra-users-gain-from-the-integration" style="position:relative;">What Hydra users gain from the integration<a href="#what-hydra-users-gain-from-the-integration" aria-label="what hydra users gain from the integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="git-based-versioning-and-caching" style="position:relative;">Git-based versioning and caching<a href="#git-based-versioning-and-caching" aria-label="git based versioning and caching permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Hydra relies on <a href="https://hydra.cc/docs/configure_hydra/workdir/" target="_blank" rel="nofollow noopener noreferrer">folder-based versioning</a> for managing multiple runs.</p> <p>By using the DVC and Hydra integration, you can version the runs using <a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">DVC experiments</a>, enabling a more <a href="https://dvc.org/doc/user-guide/experiment-management/persisting-experiments" target="_blank" rel="nofollow noopener noreferrer">git-friendly</a> workflow and adding <a href="https://dvc.org/doc/user-guide/experiment-management#run-cache-automatic-log-of-stage-runs" target="_blank" rel="nofollow noopener noreferrer">caching</a> capabilities so runs won’t be unnecessarily recomputed.</p> <h2 id="multi-step-pipelines-and-language-agnostic" style="position:relative;">Multi-step pipelines and Language Agnostic<a href="#multi-step-pipelines-and-language-agnostic" aria-label="multi step pipelines and language agnostic permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Hydra's scope is limited to a single <strong>Python script</strong> wrapped with the <code>@hydra.main</code> decorator.</p> <p>By using the <a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">DVC and Hydra integration</a>, you can use Hydra to configure entire <a href="https://dvc.org/doc/start/data-management/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">DVC pipelines</a>, which can be composed of <strong>multiple</strong> <strong>stages</strong> running <strong>arbitrary</strong> <strong>commands.</strong></p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">featurize</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/featurization.py data/prepared data/features <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/prepared <span class="token punctuation">-</span> src/featurization.py <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> featurize.max_features <span class="token punctuation">-</span> featurize.ngrams <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/features <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/train.py data/features model.pkl <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/features <span class="token punctuation">-</span> src/train.py <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> train.min_split <span class="token punctuation">-</span> train.n_est <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> model.pkl</code></pre></div> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token string">'featurize.max_features=200'</span> <span class="token parameter variable">-S</span> <span class="token string">'train.n_est=100'</span> </span>Running stage 'featurize': > python src/featurization.py data/prepared data/features Running stage 'train': > python src/train.py data/features model.pkl</code></pre></div> <h1 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Starting with DVC <code>2.25.0</code>, you can use the features described in this post to efficiently combine Hydra and DVC in your projects.</p> <p>To get a deeper understanding of all the parts involved, you can read the <a href="https://dvc.org/doc/user-guide/experiment-management/hydra-composition" target="_blank" rel="nofollow noopener noreferrer">Hydra Composition</a> page of the DVC user guide.</p>https://dvc.org/blog/september-22-heartbeathttps://dvc.org/blog/september-22-heartbeatMon, 19 Sep 2022 00:00:00 GMT<details> <p>This month’s image inspiration is community member <a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a>. Sami has contributed to DVC in the past and most recently to the DVC and CML teams with regard to extending our remote experimenting features to include running experiments in parallel, which you can check out <a href="https://github.com/iterative/dvc/commit/c7d63e8c59819592d2a749ab721fe5c85379fece" target="_blank" rel="nofollow noopener noreferrer">here</a> and [here](<a href="https://github.com/iterative/terraform-provider-iterative/compare/master...sjawhar:terraform-provider-iterative:feature/nfs-volume" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/terraform-provider-iterative/compare/master…sjawhar:terraform-provider-iterative:feature/nfs-volume</a>. Look out for him speaking at a Meetup soon on this topic!</p> <p>Last year Sami presented at one of our <a href="https://www.youtube.com/watch?v=DxZdWq3Weng" target="_blank" rel="nofollow noopener noreferrer">Office Hours meetups</a> on “What is an experiment?” More specifically he asked, at what level of granularity do you experiment and when do you share with your team? He shared great ideas, tips, and code in the session and spurred a great discussion with other community members. We look forward to the next Meetup!</p> <summary>✨Image Inspo✨</summary> </details> <details> Our Community has grown and so has the monthly Heartbeat! To help you better navigate to the content you desire, use the following ToC: <h1 id="table-of-contents" style="position:relative;">Table of contents<a href="#table-of-contents" aria-label="table of contents permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <ol> <li><a href="#from-greater-aiml-community">From Greater AI/ML Community</a> <ol> <li><a href="#meta-is-building-an-ai-to-fact-check-wikipediaall-65-million-articles">Meta Is Building an AI to Fact-Check Wikipedia</a></li> <li><a href="#european-ai-act">EU AI Act</a></li> <li><a href="#pulse-check">💗Pulse Check</a></li> </ol> </li> <li><a href="#iterative-community-news">Iterative Community News</a> <ol> <li><a href="#francesco-calcavecchia---we-refused-to-use-a-hammer-on-a-screw-story-of-gto-based-model-registry">Story of GTO-based model registry</a></li> <li><a href="#mlops-course-at-the-technical-university-of-denmark-includes-dvc-and-cml">MLOps Course at University of Denmark</a></li> <li><a href="#goku-mohandas---made-with-ml-mlops-interactive-course">Made With ML MLOps Interactive Course</a></li> <li><a href="#adri%C3%A0-romero---youtube-review-of-dvc">Lakera Review of DVC (video)</a></li> <li><a href="#sydney-firmin---reproducibility-replicability-and-data-science">Reproducibility, Replicability, and Data Science</a></li> <li><a href="#iterative-xkcd-lore">Iterative xkcd lore</a></li> </ol> </li> <li><a href="#company-news">Company News</a> <ol> <li><a href="#mlem-mlem-mlem-this-dog-food-is-good">We are eating our own dog food</a></li> <li><a href="#alex-kim-oreilley-mlops-course">New O'Reilly Course with Alex Kim</a></li> <li><a href="#latam-ai">LATAM AI</a></li> <li><a href="#new-hires">New Hires</a></li> <li><a href="#open-positions">Open Positions</a></li> <li><a href="#new-blog-posts">New Blog posts</a></li> <li><a href="#upcoming-conferences">Upcoming Conferences</a></li> </ol> </li> <li><a href="#tweet-love">Tweet Love</a></li> </ol> <summary>Table of Contents</summary> </details> <p>As the summer fades and we get revved up to finish off the year, we start the September Heartbeat with some juicy food for thought AI topics.</p> <p><img src="https://media.giphy.com/media/kPtv3UIPrv36cjxqLs/giphy.gif" alt="Will Ferrell Lol GIF by NBA"></p> <h2 id="from-greater-aiml-community" style="position:relative;">From Greater AI/ML Community<a href="#from-greater-aiml-community" aria-label="from greater aiml community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="meta-is-building-an-ai-to-fact-check-wikipediaall-65-million-articles" style="position:relative;">Meta Is Building an AI to Fact-Check Wikipedia—All 6.5 Million Articles<a href="#meta-is-building-an-ai-to-fact-check-wikipediaall-65-million-articles" aria-label="meta is building an ai to fact check wikipediaall 65 million articles permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c2295358e28c8014e45ea3b24ca41ee9/ab158/wikipedia.png" alt="Meta Fact-Checking Wikipedia" title="Meta Fact-checking Wikipedia Ai" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="http://twitter.com/vanessabramirez" target="_blank" rel="nofollow noopener noreferrer"><strong>Vanessa Bates Ramirez</strong></a> writes <a href="https://singularityhub.com/2022/08/26/meta-is-building-an-ai-to-fact-check-wikipedia-all-6-5-million-articles/" target="_blank" rel="nofollow noopener noreferrer">an article</a> in <a href="https://singularityhub.com" target="_blank" rel="nofollow noopener noreferrer">Singularity Hub</a> about Meta's plans to fact-check Wikipedia. Under the premise of making Wikipedia more accurate, <a href="https://about.facebook.com/?utm_source=meta.com&utm_medium=redirect" target="_blank" rel="nofollow noopener noreferrer">Meta</a>, in conjunction with <a href="https://www.amazon.science/tag/alexa" target="_blank" rel="nofollow noopener noreferrer">Amazon Alexa.AI</a> and <a href="https://openreview.net/pdf?id=qfTqRtkDbWZ" target="_blank" rel="nofollow noopener noreferrer">some University contributors</a> is building an AI system trained on 4 million Wikipedia citations. The system architecture made up of retrieval and verification engines, cross references not only content, but specific figures to verify accuracy.</p> <p>They’ve built an index of web pages that are chunked into passages and then provide an accurate representation of the passage to train the model. Their aim is to more accurately capture meaning as opposed to word pattern. From <a href="https://twitter.com/Fabio_Petroni" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabio Petroni</strong></a>, Meta’s Fundamental AI Research tech lead manager:</p> <blockquote> <p>[This index] is not representing word-by-word the passage, but the meaning of the passage. That means that two chunks of text with similar meaning will be represented in a very close position in the resulting n-dimensional space where all these passages are stored.</p> </blockquote> <p>They hope to ultimately be able to suggest accurate sources and create a grading system on accuracy. <a href="https://verifier.sideeditor.com/" target="_blank" rel="nofollow noopener noreferrer">You can find a demo of the project, named Side, here</a> to look at samples and go deeper into the research. They are looking for people to give feedback on the quality of the system.</p> <p>Vanessa brings up some great questions regarding this:</p> <blockquote> <p>If you imagine a not-too-distant future where everything you read on Wikipedia is accurate and reliable, wouldn’t that make doing any sort of research a bit too easy? There’s something valuable about checking and comparing various sources ourselves, is there not? It was a big leap to go from paging through heavy books to typing a few words into a search engine and hitting “Enter”; do we really want Wikipedia to move from a research jumping-off point to a gets-the-last-word source?</p> </blockquote> <p>To these I’d add, what’s Meta’s/Amazon Alexa's monetary motivation to do this (because there always is one), and given past ethical infractions on Meta's part ( <a href="https://link.springer.com/article/10.1007/s43681-021-00068-x" target="_blank" rel="nofollow noopener noreferrer">1,</a> <a href="https://www.abc.net.au/triplej/programs/hack/facebook-whistleblower-says-instagram-content-hurts-teens/13573020" target="_blank" rel="nofollow noopener noreferrer">2,</a> <a href="https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election" target="_blank" rel="nofollow noopener noreferrer">3,</a> <a href="https://www.buzzfeednews.com/article/craigsilverman/viral-fake-election-news-outperformed-real-news-on-facebook" target="_blank" rel="nofollow noopener noreferrer">4,</a> and <a href="https://www.theatlantic.com/technology/archive/2014/06/everything-we-know-about-facebooks-secret-mood-manipulation-experiment/373648/" target="_blank" rel="nofollow noopener noreferrer">5,</a>) should we applaud this? Or is this collaboration with Universities a step in the right direction?</p> <h3 id="european-ai-act" style="position:relative;">European AI Act<a href="#european-ai-act" aria-label="european ai act permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d9923667ce38b6b7e88c76dda707f8d2/bbe0c/eu.jpg" alt="EU AI Act" title="EU AI Act t" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://twitter.com/Kyle_L_Wiggers" target="_blank" rel="nofollow noopener noreferrer"><strong>Kyle Wiggers</strong></a> reports on the EU's AI Act and its potential ill effects on open source efforts in <a href="https://techcrunch.com/2022/09/06/the-eus-ai-act-could-have-a-chilling-effect-on-open-source-efforts-experts-warn/" target="_blank" rel="nofollow noopener noreferrer">this piece</a> in <a href="https://techcrunch.com" target="_blank" rel="nofollow noopener noreferrer">TechCrunch</a>. The proposed new rules would require that open source developers adhere to guidelines across a spectrum of categories including risk management, data governance, technical documentation and transparency, standards and accuracy, and cyber security. Not a negligible list.</p> <p>The article covers critiques of the Act from <a href="https://www.brookings.edu/experts/alex-engler/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Engler</strong></a> of think tank <a href="https://brookings.edu" target="_blank" rel="nofollow noopener noreferrer">Brookings</a> through <a href="https://www.brookings.edu/blog/techtank/2022/08/24/the-eus-attempt-to-regulate-open-source-ai-is-counterproductive/" target="_blank" rel="nofollow noopener noreferrer">this piece.</a> While <a href="https://twitter.com/etzioni" target="_blank" rel="nofollow noopener noreferrer"><strong>Oren Etzioni</strong></a>, the founding CEO of the <a href="https://allenai.org/" target="_blank" rel="nofollow noopener noreferrer">Allen Institute for AI</a> adds that such regulation could create an undue burden where only large tech companies could comply:</p> <blockquote> <p>“Open source developers should not be subject to the same burden as those developing commercial software. It should always be the case that free software can be provided ‘as is’ — consider the case of a single student developing an AI capability; they cannot afford to comply with EU regulations and may be forced not to distribute their software, thereby having a chilling effect on academic progress and on reproducibility of scientific results.”</p> </blockquote> <p>The article discusses some proponents to the Act, as well as alternative thought processes on the granularity of regulations (product vs. category, or downstream responsibility). Finally, it ends with some thoughts from Hugging Face CEO, <a href="https://twitter.com/ClementDelangue" target="_blank" rel="nofollow noopener noreferrer"><strong>Clément Delangue</strong></a> and his colleagues' comments on the vagueness and the problems that can arise out of this lack of clarity, including stifling competition and innovation. They also point out the growing Responsible AI initiatives such as AI licensing and model cards outlining the intended use of such open source technology as positives that are community-born.</p> <p>So does regulation stifle technology or provide guard rails?</p> <p>My colleague <a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> would like to point out that similar concerns were raised when the EU introduced the GDPR in 2016, which has turned out to be of major importance to people's rights to privacy — in the EU and worldwide.</p> <p>To what degree should AI technology be regulated? Where do you draw lines? It’s quite clear that it moves faster than lawmakers can keep up with and the potential for harm is well known at this point. We could say, as I believe, that reflection on the consequences should be baked into the building process. However, the reality in practice is that —despite best intentions— the overarching push for better and faster often results in negative consequences that are only discovered after the fact.</p> <p>How do we incentivize reflecting on consequences in our processes? Would regulation force this? Make development slower, but necessarily force the social good work that must be done in the development of AI tech?</p> <p>What other industries have similar dilemmas and how do they handle it? The Hippocratic Oath has served medicine well for thousands of years.<br> <a href="https://ojs.aaai.org/index.php/aimagazine/article/view/15090" target="_blank" rel="nofollow noopener noreferrer">Do We Need a Hippocratic Oath for Artificial Intelligence Scientists?</a></p> <h3 id="pulse-check" style="position:relative;">Pulse Check<a href="#pulse-check" aria-label="pulse check permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We would love to hear (read) your thoughts on this! We are starting a “Pulse check” topic from the Heartbeat each month up for discussion in our Discord server in the General channel. <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Come join the discussion!</a></p> <p><img src="https://media.giphy.com/media/W5JywCYOCSP8VMiVZg/giphy.gif" alt="Heartbeat GIF"></p> <h2 id="iterative-community-news" style="position:relative;">Iterative Community News<a href="#iterative-community-news" aria-label="iterative community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="francesco-calcavecchia---we-refused-to-use-a-hammer-on-a-screw-story-of-gto-based-model-registry" style="position:relative;"><strong>Francesco Calcavecchia</strong> - We refused to use a hammer on a screw: Story of GTO-based model registry<a href="#francesco-calcavecchia---we-refused-to-use-a-hammer-on-a-screw-story-of-gto-based-model-registry" aria-label="francesco calcavecchia we refused to use a hammer on a screw story of gto based model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/francescocalcavecchia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Francesco Calcavecchia</strong></a> <a href="https://medium.com/@francesco.calcavecchia/we-refused-to-use-a-hammer-on-a-screw-story-of-a-gto-based-model-registry-c540ac5d129f" target="_blank" rel="nofollow noopener noreferrer">wrote a piece</a> in <a href="https://medium.com" target="_blank" rel="nofollow noopener noreferrer">Medium</a> about building a custom model registry with <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a>.</p> <p>He acknowledges the main reasons for needing a model registry as:</p> <ol> <li>When you need model versioning</li> <li>When you need to promote or assign models to different stages</li> <li>When you need to establish production model governance</li> </ol> <p>Additionally, he finds registering the data analysis and model evaluation outputs into an artifact registry is necessary, and as such used GTO and DVC to accomplish this. He goes into more detail about why he chose GTO over MLFlow - essentially appreciating our UNIX philosophy that empowers agility over prescriptive methods that hamper your design choices. He notes:</p> <blockquote> <p><strong>It is hard to think of something simpler than this. And simplicity is beauty</strong> ❤️</p> </blockquote> <p>He then discusses some things he found missing for his needs, such as using it in a production pipeline as opposed to committing models by hand. He discusses working on solutions to build the artifact registry, introduce new commands, and streamline the process for the <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> remote storage secret requirements. Please join him in his contributions. We love to see where this is going! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/07b6eea95e4eb33178c818d1a1e0578a/03346/artifact-gto.jpg" alt="DVC GTO Artifact Registry schematic" title="DVC GTO Artifact Registry schematic" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Francesco Calcavecchia's schematic for a proposed artifact registry with DVC and GTO (<a href="https://medium.com/@francesco.calcavecchia/we-refused-to-use-a-hammer-on-a-screw-story-of-a-gto-based-model-registry-c540ac5d129f" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="mlops-course-at-the-technical-university-of-denmark-includes-dvc-and-cml" style="position:relative;">MLOps Course at the Technical University of Denmark includes DVC and CML<a href="#mlops-course-at-the-technical-university-of-denmark-includes-dvc-and-cml" aria-label="mlops course at the technical university of denmark includes dvc and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/446b7dcde514e3a7faeec88a200ee44e/03346/dtu-mlops.jpg" alt="DTU MLOps Course Memes" title="DTU MLOps Course Meme" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> The <a href="https://www.dtu.dk/english" target="_blank" rel="nofollow noopener noreferrer">Technical University of Denmark (DTU)</a> has included DVC and CML in its MLOps Course at the University. The lectures, slides, exercises, and code can be found in <a href="https://github.com/SkafteNicki/dtu_mlops" target="_blank" rel="nofollow noopener noreferrer">this repo</a> from <a href="https://github.com/SkafteNicki" target="_blank" rel="nofollow noopener noreferrer"><strong>Nicki Skafte Detlefsen</strong></a>, Postdoc in the section of Cognitive Systems at the University with a focus on generative models and geometrical deep learning. There are 10 sections covering:</p> <ol> <li>Getting started</li> <li>Organization and version control (find Git and DVC here)</li> <li>Reproducibility</li> <li>Debugging and logging</li> <li>Continuous X (find CML here)</li> <li>The Cloud</li> <li>Scalable applications</li> <li>Deployment</li> <li>Monitoring</li> <li>Extra Resources</li> </ol> <p>The materials are great and even include some funny memes. Isn't an open-source model amazing for learning? Cheers to DTU for including our tools and the open source sharing of these learning materials with the world!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 640px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/61262120d4bbd20269bd3e99788d3d50/bbe0c/dtu-bad-code.jpg" alt="DTU bad code comic" title="DTU bad code comic" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Good code review vs. Bad code review (<a href="https://github.com/SkafteNicki/dtu_mlops/blob/main/s2_organisation_and_version_control/S2.md" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="goku-mohandas---made-with-ml-mlops-interactive-course" style="position:relative;"><strong>Goku Mohandas</strong> - Made With ML MLOps Interactive Course<a href="#goku-mohandas---made-with-ml-mlops-interactive-course" aria-label="goku mohandas made with ml mlops interactive course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You likely already know of <a href="https://github.com/GokuMohandas" target="_blank" rel="nofollow noopener noreferrer"><strong>Goku Mohandas'</strong></a> wildly popular free course <a href="https://madewithml.com/#mlops" target="_blank" rel="nofollow noopener noreferrer">Made with ML</a>, which includes DVC. Knowing that it can be challenging to learn everything on your own, he is starting an interactive class beginning on October 1st. The deadline for application is September 25th.<br> <a href="https://madewithml.com/#interactive-course" target="_blank" rel="nofollow noopener noreferrer">For more info find the details here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/149517f06f19a9dd9fff3925347b556a/39600/made-with-ml.png" alt="Goku Mohandas - Made with ML MLOps" title="Goku Mohandas - Made with ML MLOps" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Goku Mohandas' Made with ML Interactive Course (<a href="https://madewithml.com/#mlops" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="adrià-romero---youtube-review-of-dvc" style="position:relative;"><strong>Adrià Romero</strong> - YouTube review of DVC<a href="#adri%C3%A0-romero---youtube-review-of-dvc" aria-label="adrià romero youtube review of dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/adriaromero/" target="_blank" rel="nofollow noopener noreferrer"><strong>Adrià Romero</strong></a>, Computer Vision Developer at <a href="https://www.lakera.ai/" target="_blank" rel="nofollow noopener noreferrer">Lakera</a>, has a regular tool review on tools that can make computer vision easier, and recently reviewed DVC. He does a demo of DVC pushing up to a Google Drive remote and goes over how to share DVC-tracked data. He then covers the data pipelines functionality that can be used for CI/CD pipelines and shows the benefits of tracking the versions of everything including data, models, pipelines, parameters, and experiments. Finally, he mentioned that our documentation is super clear and useful, which makes us very happy. 🦉Check out the review below.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DXlxr4sEnc0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="sydney-firmin---reproducibility-replicability-and-data-science" style="position:relative;"><strong>Sydney Firmin</strong> - Reproducibility, Replicability, and Data Science<a href="#sydney-firmin---reproducibility-replicability-and-data-science" aria-label="sydney firmin reproducibility replicability and data science permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/de2134806780269b20cad03f670f6247/8a54c/the_difference.png" alt="Sydney Firmin - Reproducibility, Replicability, and Data Science" title="Sydney Firmin - Reproducibility, Replicability, and Data Science" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p><a href="https://www.linkedin.com/in/sydney-f-4369a65b/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sydney Firmin</strong></a> writes <a href="https://www.kdnuggets.com/2019/11/reproducibility-replicability-data-science.html" target="_blank" rel="nofollow noopener noreferrer">a wonderful piece</a> in KD Nuggets outlining the replicability crisis, the importance of reproducibility in science in general and data science in particular. She highlights the growing awareness of irreproducible research due to technology's help to make all research better circulated. She encourages standardizing a paradigm of reproducibility in data science work to promote efficiency, accuracy, and to help your future self and colleagues check work and reduce bugs.</p> <p>Of course, she recommends DVC as a possible tool to help with this and notes,</p> <blockquote> <p>fun fact, this is my second attempt at writing this post after my computer was <a href="https://en.wikipedia.org/wiki/Brick_(electronics)" target="_blank" rel="nofollow noopener noreferrer">bricked</a> last week. I am now compulsively saving all of my work in <a href="https://www.vox.com/2015/4/30/11562024/too-embarrassed-to-ask-what-is-the-cloud-and-how-does-it-work" target="_blank" rel="nofollow noopener noreferrer">the cloud</a>.</p> </blockquote> <p>Haven’t we all been there? 🙋🏻‍♀️ She goes on to describe other contributors to irreproducible results including p-hacking and discusses other methods in addition to tooling that can help, such as preventing overfitting and using a sufficiently large dataset, and team review. All this and some fun xkcd comics can be found in the post including <a href="https://xkcd.com/242" target="_blank" rel="nofollow noopener noreferrer">this one shown above</a>!</p> <details> <p>Speaking of xkcd comics, <a href="https://github.com/casperdcl" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa Luis</strong></a>, CML Product Manger, loves xkcd and regularly regales us with the comics in our internal Slack. He is also an expert at TL;DRing (yes, I just made that a verb). Part of his process in this excellence is to “<a href="https://tldr.cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">suppress my latent desire to add a relevant xkcd comic</a>.” As you can see, they do not appear every day. Self-discipline is a good thing.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fc1c3a4e4b20260e7d3f261f11d650c0/39600/casper-xkcd.png" alt="Casper da Costa Luis and xkcd comics" title="Casper da Costa Luis and xkcd comics" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Casper da Costa Luis' propensity for Slack slinging xkcd comics</em></p> <summary id="iterative-xkcd-lore">😄 Iterative xkcd Lore</summary> </details> <h2 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><img src="https://media.giphy.com/media/ji6BdEco3I29DTXddx/giphy-downsized-large.gif" alt="Happy Dog Food GIF by Diamond Pet Foods"></p> <h3 id="mlem-mlem-mlem-this-dog-food-is-good" style="position:relative;">MLEM, MLEM, MLEM, this dog food is good!<a href="#mlem-mlem-mlem-this-dog-food-is-good" aria-label="mlem mlem mlem this dog food is good permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>So over the summer, you may have noticed that our blog has moved from the <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> website to the <a href="https://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative</a> website. This is because as we now have many more tools than DVC, we wanted to make a blog home for them all. In this transition, we have also changed our internal blog writing process from being just Git-dependent to Git- and DVC- dependent, such that the writing is in Git, but the images are versioned with DVC and stored in a remote. 🤗</p> <p>This admittedly may be like bringing a <a href="https://arclightcnc.com/product/cnc-router-kit" target="_blank" rel="nofollow noopener noreferrer">CNC router</a> to a steak dinner (I feel like there should be a Myth Busters episode on this). <strong>But</strong> it will help both the DevRel team and the Websites team become intimately familiar with what our users feel when using our tools and potentially drive more feature improvements for you. In other words, we ❤️ you and we're really serious about making our tools better for you so you don't have to build them yourselves!</p> <p><img src="https://media.giphy.com/media/wdA6Ql7ku32JZKXBFV/giphy.gif" alt="Ken Jeong Masked Singer GIF by FOX TV"></p> <h3 id="alex-kim-oreilly-mlops-course" style="position:relative;"><strong>Alex Kim</strong> O'Reilly MLOps Course<a href="#alex-kim-oreilly-mlops-course" aria-label="alex kim oreilly mlops course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ed1f0f9c5183f7978b8ff0cec495e436/39600/alex-oreilly.png" alt="Open-source MLOps in 4 weeks with Alex Kim" title="Open Source MLOps in 4 weeks" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> is working with <a href="https://www.oreilly.com/" target="_blank" rel="nofollow noopener noreferrer">O'Reilly</a> on a course entitled <em>Open-source MLOps in 4 weeks</em>. Here is an outline of what you will be learning in the course which starts on November 8th and again on January 10th:</p> <ul> <li>Week 1: Kick-starting an ML project</li> <li>Week 2: ML pipelines and reproducibility</li> <li>Week 3: Serving ML models as web API services</li> <li>Week 4: CI/CD and monitoring for ML projects</li> </ul> <p><a href="https://learning.oreilly.com/live-events/open-source-mlops-in-4-weeks/0636920080215/0636920080214/" target="_blank" rel="nofollow noopener noreferrer">Head here to sign up for the course</a></p> <h3 id="latam-ai" style="position:relative;">LATAM AI<a href="#latam-ai" aria-label="latam ai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> and our lead docs writer, <a href="https://twitter.com/JorgeOrpinel" target="_blank" rel="nofollow noopener noreferrer"><strong>Jorge Orpinel Perez</strong></a>, got to experience <a href="https://www.latam-ai.com/" target="_blank" rel="nofollow noopener noreferrer">LATAM AI</a> this year. Gema gave the talk <em>Reproducibility and version control are important: Follow-up experiments with the DVC extension for VS Code</em>. Both Gema and Jorge enjoyed the conference and meeting lots of people. Below you can see Gema with the winners of our DeeVee's Ramen Run Game. In the game, players have to roam DeeVee city answering questions to win Ramen and the highest place on the leaderboard. Get yourself to one of the conferences we are attending to play! See winners Miguel Moran Flores, Efren Bautista Linares and Rodofo Ferro below.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c4a5039960064b937e6b3c52018a78be/03346/latam-ai-winners.jpg" alt="Efren Bautista Linares, Miguel Moran Flores, Rodolfo Ferro with Gema Parreño" title="Efren Bautista Linares, Miguel Moran Flores, Rodolfo Ferro with Gema Parreño" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Winners of DeeVee's Ramen Run game with Gema, Left to Right: Efren Bautista Linares, Miguel Moran Flores, Gema Parreño Piqueras, and Rodolfo Ferro</em></p> <h3 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/ronan-lamy-84133612/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ronan Lamy</strong></a> joins the DVC team from Bristol, UK. He has a Ph.D. in physics and had been working as an open-source contractor as core dev of PyPy and HPy before joining Iterative. When he's not working Ronan enjoys exploring the many fine restaurants and great local beers of Bristol. Originally from France, Ronan recently shared with me that his friends and family back home don't believe that the food can be so good in Bristol, but he insists it is. Add it to your bucket list! When in Bristol, Ronan has recommendations for you!</p> <p><a href="https://github.com/nimdraugsael" target="_blank" rel="nofollow noopener noreferrer"><strong>Aleksei Shaikhaleev</strong></a> joins the Studio team as a backend developer. Originally from Russia, Aleksei has called Phuket, Thailand his home base for the last 10 years. When he's not working, he's really into surfing, skateboarding, motorcycles, and other fun activities like these. Aleksei also has a heart for rescuing cats, having adopted and caring for five stray cats at home!</p> <p><a href="https://www.linkedin.com/in/david-tulga-60b29410/" target="_blank" rel="nofollow noopener noreferrer"><strong>David Tulga</strong></a> is our latest hire, and joins the LDB team from California as a Senior Software Engineer. He previously worked at Asimov and Freenome. When not working David enjoys a variety of outdoor activities such as Biking, Hiking, Kayaking, Sailing, and Astronomy.</p> <p>David's arrival marks the 4th David on the team, putting the name David in a three-way tie with versions of Daniel and Alexander! Indeed over 20% of our workforce is named David, Daniel, or Alexander. 😅</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the open positions. Please share with anyone looking to have a lot of fun building the next generation of machine learning to production tools! 🚀 But don't apply if your name is David, Daniel, or Alexander. Unless you're willing to be nick-named, of course! It's getting confusing around here. 😂</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/14a8b26dd92c5a19428d6a7bef2078f0/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="new-blog-posts" style="position:relative;">✍🏼 New Blog posts<a href="#new-blog-posts" aria-label="new blog posts permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> created a tutorial for using CML with <a href="https://bitbucket.org/" target="_blank" rel="nofollow noopener noreferrer">Bitbucket</a>, which CML now supports. Be sure to read it if Bitbucket is your Git provider of choice!</li> <li><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras'</strong></a> <a href="https://dvc.org/blog/august-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">August Community Gems</a> is full of great questions from the Community from our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord server</a>.</li> </ul> <h2 id="upcoming-conferences" style="position:relative;">Upcoming Conferences<a href="#upcoming-conferences" aria-label="upcoming conferences permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Conferences we will be attending through the end of the year:</p> <ul> <li> <p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> and <a href="https://github.com/mike0sv" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Sveshnikov</strong></a> will be giving a talk and workshop on our GitOps approach to a Model registry at <a href="https://twimlai.com/conf/twimlcon/2022/" target="_blank" rel="nofollow noopener noreferrer">TWIML Con</a> on October 4-7 (On-line)</p> </li> <li> <p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> will speak at <a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a> in San Francisco on November 1-3 on the same topic</p> </li> <li> <p><a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> will be speaking at <a href="https://deeplearningworld.de/" target="_blank" rel="nofollow noopener noreferrer">Deep Learning World</a> - Berlin, October 5-6 with the talk <em>Becoming a Pokémon Master with DVC: Experiment Pipelines for Deep Learning Projects</em></p> </li> <li> <p><a href="https://cdcl.ml/" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa Luis</strong></a> will be giving the talk <em>Painless cloud orchestration without leaving your IDE</em> at <a href="https://www.re-work.co/events/mlops-summit-2022" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit - Re-work</a> - London, November 8-9</p> </li> <li> <p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> will be speaking at <a href="https://www.githubuniverse.com/" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a> on November 9-10 with the talk <em>Connecting Machine Learning with Git: ML experiment tracking with Codespaces</em>!</p> </li> <li> <p>Finally, we will be participating in <a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a> - November 29-30 in Toronto, talks TBD</p> <h2 id="tweet-love" style="position:relative;">❤️ Tweet Love<a href="#tweet-love" aria-label="tweet love permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We loved finding DVC and CML used for benchmarking and reporting at <a href="https://huggingface.co" target="_blank" rel="nofollow noopener noreferrer">Huggingface</a> thanks to the tip-off from <a href="https://twitter.com/osanseviero" target="_blank" rel="nofollow noopener noreferrer">Omar Sanseviero</a>! Look out for more projects involving Hugginface and our tools coming soon!</p> </li> </ul> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr"><a href="https://twitter.com/huggingface">@huggingface</a> datasets uses <a href="https://twitter.com/DVCorg">@DVCorg</a> & CML for benchmark and reporting 🥰 . More about the .yaml structure here --> <a href="https://t.co/NY5FMzjNuR">https://t.co/NY5FMzjNuR</a> Glad to discover common opensourceness <a href="https://twitter.com/osanseviero">@osanseviero</a> ! 🤗🥹🦉</p>— Gema Parreño (@SoyGema) <a href="https://twitter.com/SoyGema/status/1567824457296642048">September 8, 2022</a></blockquote> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/bitbucket-cml-runnershttps://dvc.org/blog/bitbucket-cml-runnersTue, 06 Sep 2022 00:00:00 GMT<p>A while ago, we learned about <a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">training models in the cloud and saving them in Git</a>. We did so using <a href="https://cml.dev/doc/start/github" target="_blank" rel="nofollow noopener noreferrer">CML and GitHub Actions</a>. GitLab is <a href="https://cml.dev/doc/start/gitlab" target="_blank" rel="nofollow noopener noreferrer">also supported</a>, and a <a href="https://github.com/iterative/cml/releases/tag/v0.16.0" target="_blank" rel="nofollow noopener noreferrer">recent CML release</a> incorporated support for self-hosted runners in Bitbucket Pipelines: a good excuse to revisit this topic and show how CML works in conjunction with Bitbucket's CI/CD.</p> <p>Using CML to provision cloud instances for our model (re)training has a number of benefits:</p> <ul> <li>Bring Your Own Cloud: a single CML command connects your existing cloud to your existing CI/CD</li> <li>Cloud abstraction: CML handles the interaction with our cloud provider, removing the need to configure resources directly. We could even switch cloud providers by changing a single parameter</li> <li>Auto-termination: CML automatically terminates instances once they are no longer being used, reducing idle time (and costs)</li> </ul> <h1 id="what-well-be-doing" style="position:relative;">What we'll be doing<a href="#what-well-be-doing" aria-label="what well be doing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>This guide will explore how we can use CML to (re)train models from one of our Bitbucket pipelines. We will:</p> <ol> <li>Provision an EC2 instance on Amazon Web Services (AWS) from a Bitbucket pipeline</li> <li>Train a machine learning model on the provisioned instance</li> <li>Open a pull request that adds the resulting model to our Bitbucket repository</li> </ol> <p>While we could use Bitbucket's own runners for our model training, they have <a href="https://support.atlassian.com/bitbucket-cloud/docs/limitations-of-bitbucket-pipelines/#LimitationsofBitbucketPipelines-Buildlimits" target="_blank" rel="nofollow noopener noreferrer">limited</a> memory, storage, and processing power. Self-hosted runners let us work around these limitations: we can get a runner with specifications tailored to our computing needs. CML greatly simplifies the setup and orchestration of these runners.</p> <p>Moreover, if our data is hosted by our cloud provider, using a runner on the same cloud would be a logical approach to minimize data transfer costs and time.</p> <admon type="tip"> <p>While we'll be using <a href="https://cml.dev/doc/self-hosted-runners?tab=AWS#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">AWS</a> in this guide, CML works just as well with <a href="https://cml.dev/doc/self-hosted-runners?tab=GCP#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Google Cloud Platform</a>, <a href="https://cml.dev/doc/self-hosted-runners?tab=Azure#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Microsoft Azure</a>, and <a href="https://cml.dev/doc/self-hosted-runners#on-premise-local-runners" target="_blank" rel="nofollow noopener noreferrer">on-premise</a> machines. Of course, CML would need the appropriate credentials, but otherwise, it takes care of the differing configuration for us.</p> </admon> <h1 id="before-we-start" style="position:relative;">Before we start<a href="#before-we-start" aria-label="before we start permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>You can clone the repository for this guide <a href="https://bitbucket.org/iterative-ai/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <p>To help follow along, you may want to keep the <a href="https://cml.dev/doc/start/bitbucket" target="_blank" rel="nofollow noopener noreferrer">Getting started section of the CML docs</a> open in another tab. The docs cover the following prerequisite steps you'll need to take if you want to follow along with this blog post:</p> <ol> <li><a href="https://cml.dev/doc/self-hosted-runners?tab=Bitbucket#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">Generate a <code>REPO_TOKEN</code> and set it as a repository variable</a>.</li> <li><a href="https://cml.dev/doc/ref/send-comment#bitbucket" target="_blank" rel="nofollow noopener noreferrer">Install the <em>Pull Request Commit Links app</em> in your Bitbucket workspace</a></li> </ol> <p>Additionally, you will need to take the following steps to allow Bitbucket to provision AWS EC2 instances on your behalf:</p> <ol> <li><a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-creds" target="_blank" rel="nofollow noopener noreferrer">Create an <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> on AWS</a></li> <li><a href="https://support.atlassian.com/bitbucket-cloud/docs/variables-and-secrets/" target="_blank" rel="nofollow noopener noreferrer">Add the <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> as repository variables</a></li> </ol> <admon type="warn"> <p>In this example, we will be provisioning an <code>m5.2xlarge</code> <a href="https://aws.amazon.com/ec2/instance-types/" target="_blank" rel="nofollow noopener noreferrer">AWS EC2 instance</a>. Note that this instance is not included in the free tier, and Amazon <a href="https://aws.amazon.com/ec2/pricing/on-demand/" target="_blank" rel="nofollow noopener noreferrer">will charge you for your usage</a> ($0.45 per hour at the time of writing). To minimize cost, CML always terminates the instance upon completion of the pipeline.</p> </admon> <h1 id="implementing-the-cml-bitbucket-pipeline" style="position:relative;">Implementing the CML Bitbucket pipeline<a href="#implementing-the-cml-bitbucket-pipeline" aria-label="implementing the cml bitbucket pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>The main point of interest in the project repository is the <code>bitbucket-pipelines.yml</code> file. Bitbucket will automatically recognize this file as the one containing our pipeline configuration. In our case, we have defined one pipeline (named <code>default</code>) that consists of two steps:</p> <h2 id="launch-self-hosted-runner" style="position:relative;">Launch self-hosted runner<a href="#launch-self-hosted-runner" aria-label="launch self hosted runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In the first step, we specify the runner we want to provision. We use a CML docker image and configure a runner on a medium (<code>m</code>) instance. CML <a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type" target="_blank" rel="nofollow noopener noreferrer">automatically translates this generic type to a cloud-specific one</a>. In the case of AWS, this corresponds with an <code>m5.2xlarge</code> instance.</p> <p>We also specify the <code>--idle-timeout=30min</code> and <code>--reuse-idle</code> options. The first of these specifies how long the provisioned instance should wait for jobs before it is terminated. This ensures that we are not racking up costs due to our instances running endlessly. With the latter, we ensure that a new instance is only provisioned when a runner is not already available with the same label. Combining these two options means that we can automatically scale up the number of runners (if there are multiple pull requests in parallel) and scale down when they are no longer required.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">step</span><span class="token punctuation">:</span> <span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token punctuation">|</span><span class="token scalar string"> cml runner \ --cloud=aws \ --cloud-region=us-west \ --cloud-type=m \ --idle-timout=30min \ --reuse-idle \ --labels=cml.runner</span></code></pre></div> <admon type="tip"> <p>CML <a href="https://cml.dev/doc/ref/runner" target="_blank" rel="nofollow noopener noreferrer">has many more options</a> that might pique your interest. For example, you could use <code>--single</code> to terminate instances right after completing one job. Or you could set a maximum bidding price for spot instances with <code>--cloud-spot-price=...</code>. With these features, CML helps you tailor instances precisely to your needs.</p> </admon> <h2 id="train-model-on-self-hosted-runner" style="position:relative;">Train model on self-hosted runner<a href="#train-model-on-self-hosted-runner" aria-label="train model on self hosted runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The second step in our pipeline defines the model training task. We specify that this step should run on the <code>[self.hosted, cml.runner]</code> we provisioned above. From here, our script defines the individual commands as we could also run them in our local terminal.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">step</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self.hosted<span class="token punctuation">,</span> cml.runner<span class="token punctuation">]</span> <span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token comment"># GPU not yet supported, see https://github.com/iterative/cml/issues/1015</span> <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>r requirements.txt <span class="token punctuation">-</span> python get_data.py <span class="token punctuation">-</span> python train.py <span class="token comment"># Create pull request</span> <span class="token punctuation">-</span> cml pr model/random_forest.joblib <span class="token comment"># Create CML report</span> <span class="token punctuation">-</span> cat model/metrics.txt <span class="token punctuation">></span> report.md <span class="token punctuation">-</span> echo '' <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token punctuation">-</span> echo '<span class="token tag">!</span><span class="token punctuation">[</span>Confusion Matrix<span class="token punctuation">]</span>(model/confusion_matrix.png)' <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token punctuation">-</span> cml send<span class="token punctuation">-</span>comment <span class="token punctuation">-</span><span class="token punctuation">-</span>pr <span class="token punctuation">-</span><span class="token punctuation">-</span>update <span class="token punctuation">-</span><span class="token punctuation">-</span>publish report.md</code></pre></div> <p>First, we install our requirements, and then we run our data loading and model training scripts. At this point, our runner contains our newly trained model. However, we need to take a few extra steps to do something with that model. Otherwise, our results would be lost when CML terminates the instance.</p> <p>To add our model to our repository, we create a pull request with <code>cml pr</code>. We also create a CML report that displays the model performance in the pull request. We add the metrics and the confusion matrix created in <code>train.py</code> to the report, and <code>cml send-comment</code> updates the description of the pull request to the contents of <code>report.md</code> (i.e., our <code>metrics.txt</code> and confusion matrix).</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 482.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3db3ba81ce9aa53bd01025d6ef50cd79/39600/pr-screenshot.png" alt="The model training report in the pull request" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>The resulting pull request showing the model training report</em></p> <p>That's all there is to it! Once CML has created the pull request, we can merge it on Bitbucket. CML will automatically terminate the cloud instance after its specified idle time, thus saving us from high AWS expenses.</p> <admon type="tip"> <p>You might be interested in storing the resulting model in a DVC remote, rather than in your Git repository. <a href="https://iterative.ai/blog/CML-runners-saving-models-2" target="_blank" rel="nofollow noopener noreferrer">Follow this guide to learn how to do so</a>.</p> </admon> <h1 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>CML allows us to incorporate our model training into our Bitbucket CI/CD. We can define a pipeline to provision a cloud instance that meets our requirements and then use the instance to train our model. The resulting model can be pushed to our Git repository, along with a detailed report on our model's performance.</p> <p>Because CML handles the interaction with our cloud provider of choice, we can switch between different providers (AWS, Azure, or Google Cloud Project) by changing a single line. Moreover, CML automatically reduces our cloud expenses by terminating instances we are no longer using.</p> <p>Now that we got started with CML in Bitbucket Pipelines, we can look toward some of CML's more advanced features. It might be worth exploring CML's spot recovery, for example, which can pick up training from the last epoch when a script is randomly terminated. Or we might be interested in training models on GPUs, which CML is also well-suited for.</p> <p>These topics warrant their own guides, however. Keep an eye out for these follow-ups on our blog, and make sure to let us know what you would like us to cover next! You can let us know in the comments or by <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">joining our Discord server</a>.</p>https://dvc.org/blog/august-22-community-gemshttps://dvc.org/blog/august-22-community-gemsTue, 30 Aug 2022 00:00:00 GMT<p>Hi there! This is Gema! Today I'll be the guide to Community Gems for August. Big shout out to <a href="https://twitter.com/flippedcoding" target="_blank" rel="nofollow noopener noreferrer">Milecia Mcgregor</a> that co-authors this post.</p> <h2 id="if-i-am-tracking-a-directory-with-dvc-how-can-i-read-the-file-names-without-using-dvc-checkout" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1001787488173572147" target="_blank" rel="nofollow noopener noreferrer">If I am tracking a directory with DVC, how can I read the file names without using <code>dvc checkout</code>?</a><a href="#if-i-am-tracking-a-directory-with-dvc-how-can-i-read-the-file-names-without-using-dvc-checkout" aria-label="if i am tracking a directory with dvc how can i read the file names without using dvc checkout permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This is a wonderful question from @Mikita Karotchykau!</p> <p>You can read those file names with our DVC Python API. Here's an example of how that may work:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> os <span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo <span class="token keyword">for</span> item <span class="token keyword">in</span> Repo<span class="token punctuation">.</span>ls<span class="token punctuation">(</span> <span class="token string">"<repo_path_or_url>"</span><span class="token punctuation">,</span> <span class="token string">"/path/to/dir"</span><span class="token punctuation">,</span> dvc_only<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> rev<span class="token operator">=</span><span class="token string">"<rev>"</span><span class="token punctuation">,</span> recursive<span class="token operator">=</span><span class="token boolean">True</span> <span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token string">"/path/to/dir"</span><span class="token punctuation">,</span> item<span class="token punctuation">[</span><span class="token string">"path"</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div> <h2 id="how-can-i-mock-the-execution-of-certain-stages-in-dvc-repro" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1004408394888777738" target="_blank" rel="nofollow noopener noreferrer">How can I mock the execution of certain stages in <code>dvc repro</code>?</a><a href="#how-can-i-mock-the-execution-of-certain-stages-in-dvc-repro" aria-label="how can i mock the execution of certain stages in dvc repro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Nice situation posted as a question from @JesusCerquides!</p> <p>This situation might arise when you have stages that take a long time to run or when you are confident about them and want to advance with the pipeline design; therefore, you wouldn't want to reproduce all again. One example might be when you have a good enough feature engineering and want to iterate over hyperparameters in training.</p> <p>You should be able to run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> in this case as it provides a way to complete <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> when it has been used with the <code>--no-commit</code> or <code>--no-exec</code> options. Those options cause the command to skip certain stages so you can move to another stage without executing all of them.</p> <h2 id="how-can-i-change-the-dataset-for-a-dvc-pipeline-that-runs-completely-with-dvc-repro" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1004432985052942396" target="_blank" rel="nofollow noopener noreferrer">How can I change the dataset for a DVC pipeline that runs completely with <code>dvc repro</code>?</a><a href="#how-can-i-change-the-dataset-for-a-dvc-pipeline-that-runs-completely-with-dvc-repro" aria-label="how can i change the dataset for a dvc pipeline that runs completely with dvc repro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Great question from @5216!</p> <p>One of the straightforward solutions for this challenge is to replace the dataset in place and run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> again. If the dataset is at some other path, you can update <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> to use that new path instead of the original dataset path. If you don't want to lose the previous pipeline and want to keep it and results for future reproducibility or other needs, you can use <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> as it keeps a record in Git of all changes and allows you to create a branch if needed.</p> <h2 id="when-i-trigger-a-github-event-i-use-pull_request-types-labeled-and-it-seems-to-cause-the-runner-to-use-the-wrong-sha-how-can-i-fix-this" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/1001003933159915550" target="_blank" rel="nofollow noopener noreferrer">When I trigger a GitHub event, I use <code>pull_request: types: [labeled]</code> and it seems to cause the runner to use the wrong SHA. How can I fix this?</a><a href="#when-i-trigger-a-github-event-i-use-pull_request-types-labeled-and-it-seems-to-cause-the-runner-to-use-the-wrong-sha-how-can-i-fix-this" aria-label="when i trigger a github event i use pull_request types labeled and it seems to cause the runner to use the wrong sha how can i fix this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thanks for the good question @hyojoo!</p> <p>You might have encounter that this issue doesn´t allow you to send comments to the PR. A <a href="https://github.com/iterative/cml/issues/880#issuecomment-1145522505" target="_blank" rel="nofollow noopener noreferrer">change</a> with respect to the SHAs made us point to the head reference.</p> <p>We've updated <a href="https://cml.dev/doc/start/github" target="_blank" rel="nofollow noopener noreferrer">CML Start</a> to include a fix:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v3 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">ref</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> github.event.pull_request.head.sha <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div> <h2 id="how-does-dvc-solve-the-file-versioning-problem-specifically-when-we-want-to-roll-back-to-previous-versions-of-the-dataset" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/1005130028692017184" target="_blank" rel="nofollow noopener noreferrer">How does DVC solve the file versioning problem, specifically when we want to roll back to previous versions of the dataset?</a><a href="#how-does-dvc-solve-the-file-versioning-problem-specifically-when-we-want-to-roll-back-to-previous-versions-of-the-dataset" aria-label="how does dvc solve the file versioning problem specifically when we want to roll back to previous versions of the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Time travel with DVC ! We just find this topic fascinating. Thanks for bringing this up @MiaM</p> <p><code>git checkout</code> command lets us restore any commit in the repository history. It will automatically adjust the repository files, by replacing, adding or deleting them. This git command changes <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> and another DVC files, meaning that git tracks DVC files, but doesn´t track the file per se. For this to happen and get back to previous versions of the dataset, make sure to <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> on this one.</p> <p>For reproducibility, we will see now what happens with the <code>data.dvc</code> file and cache folder when we go back to a previous dataset version. For that, we will add a dataset, change it and add it to DVC, and then get back to the first dataset version.</p> <p>First, we have added a dataset, and then add it as well with DVC: if we explore the <code>data.xml.dvc</code> file and the cache folder , we will see the MD5 hash for the file, a unique identifier!</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.xml.dvc <span class="token comment"># will show file info including MD5 hash</span> </span>outs: - md5: a8d60da582524dac805fc7b64d762e58 size: 33471 path: data.xml <span class="token line"><span class="token input">$ </span><span class="token command">cd</span> .dvc/cache </span><span class="token line"><span class="token input">$ </span><span class="token command">tree</span> <span class="token comment"># will show dataset in the cache with hash reference</span> </span>. |___ a8 |___ a8d60da582524dac805fc7b64d762e58 </code></pre></div> <p>After changing the dataset, we have added it to DVC as well. As you can see in <code>data.xml.dvc</code> file, the hash MD5 has changed, as the dataset is different! The cache , however keeps both hashes. Smart!</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.xml.dvc <span class="token comment"># will show new file info including MD5 hash</span> </span>outs: - md5: 8e4ed00d7118e31340db6c0ba572658e size: 35263 path: data.xml <span class="token line"><span class="token input">$ </span><span class="token command">cd</span> .dvc/cache </span><span class="token line"><span class="token input">$ </span><span class="token command">tree</span> <span class="token comment"># will show both datasets in the cache with their hash reference</span> </span>. |___ 8e | |___ 4ed00d7118e31340db6c0ba572658e |___ a8 |___ d60da582524dac805fc7b64d762e58</code></pre></div> <p>Now let´s get back to the previous version of the dataset</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> HEAD~1 data/data.xml.dvc </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> data/data.xml.dvc</span></code></pre></div> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.xml.dvc </span>outs: - md5: a8d60da582524dac805fc7b64d762e58 size: 33471 path: data.xml</code></pre></div> <p>Interesting! The hash makes reference to the previous version of our dataset that has been stored in our cache folder. The cache folder saves the data so DVC allows you to get back to previous files with the synced <code>git checkout</code> and <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> commands. Please note that you have to checkout with Git, but also with DVC! If you always want to ensure <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> after <code>git checkout</code> you can use <code>post-chekout</code> <a href="https://dvc.org/doc/command-reference/install#installed-git-hooks" target="_blank" rel="nofollow noopener noreferrer">Git hook</a> to automatically update the workspace with the correct data file versions.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 455px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/56e4e2a30bbced5dae70c20873eee9e8/39600/backtothefuture.png" alt="back to the future" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h2 id="how-can-i-plot-the-result-metrics-for-the-machine-learning-experiments-inside-vscode-dvc-extension-scenario" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/842220310585147452/991695952480043038" target="_blank" rel="nofollow noopener noreferrer">How can I plot the result metrics for the machine learning experiments inside VSCode DVC extension scenario?</a><a href="#how-can-i-plot-the-result-metrics-for-the-machine-learning-experiments-inside-vscode-dvc-extension-scenario" aria-label="how can i plot the result metrics for the machine learning experiments inside vscode dvc extension scenario permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Happy to discover that you are using <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension</a> for VSCode @Julian_ !</p> <p>You can define your plots with <a href="https://dvc.org/doc/dvclive/dvclive-with-dvc" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> depending on your machine learning challenge and save them as a CSV, JSON file or other <a href="https://dvc.org/doc/user-guide/visualizing-plots#supported-file-formats" target="_blank" rel="nofollow noopener noreferrer">supported format</a>. You need to list it as a plots output in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, adding plots in the build stage</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">build</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> features.csv <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> model.pt <span class="token key atrule">metrics</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">metrics.json</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">plots</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">metrics.csv</span><span class="token punctuation">:</span> <span class="token comment"># specify the name and .csv extension file</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></code></pre></div> <h2 id="im-constructing-a-pipeline-with-several-stages-inside-the-dvcyaml-file" style="position:relative;">[Im constructing a pipeline with several stages inside the <code>dvc.yaml</code> file.<a href="#im-constructing-a-pipeline-with-several-stages-inside-the-dvcyaml-file" aria-label="im constructing a pipeline with several stages inside the dvcyaml file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When I execute dvc exp run or dvc repro commands, stages run randomly. What is the reason behind this or did I miss something ?] (<a href="https://discord.com/channels/485586884165107732/563406153334128681/1011617355849269258" target="_blank" rel="nofollow noopener noreferrer">https://discord.com/channels/485586884165107732/563406153334128681/1011617355849269258</a>)</p> <p>Hello there @ekmekci48 ! That is indeed a really great question.</p> <p>In order to ensure linear order in your pipeline, you should concatenate all your pipeline stages, taking into account that the previous stage output will be the next dependency, from the beginning to the end of your pipeline. Please make sure that you specify dependencies and outputs for each stage: that will introduce the order to provide an end result. For stages that don´t depend on each other, they will still executed randomly.</p> <p>As an example, imagine that we have 3 stages: load , feature engineering and training. Load output with be feature engineering dependency, and feature engineering output will be training dependency.</p> <p>The key concept to have into account here is that you should concatenate the output of one stage as the dependency of the other among all pipeline stages.</p> <p>As an example, added some schema from our learning <a href="https://learn.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">course</a>: check out the <code>-o</code> and <code>-d</code> config flags . Those will be key for concatenating your stages.</p> <p>Let's also thank @daavoo for helping you out pointing to the docs on this one!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c84b6595555ff4070d3c1ac55e69caf5/39600/pipelines.png" alt="notes from pipelines lesson iterative learning course" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Please check out the <a href="https://dvc.org/doc/command-reference/dag" target="_blank" rel="nofollow noopener noreferrer">docs</a> to know more!</p> <hr> <p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p> <p>Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date with us to find out what it is! <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">Join our group</a> to stay up to date with specifics as we get closer to the event!</p> <p>Check out <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">our docs</a> to get all your DVC, CML, and MLEM questions answered!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to chat with the community!</p>https://dvc.org/blog/august-22-heartbeathttps://dvc.org/blog/august-22-heartbeatTue, 16 Aug 2022 00:00:00 GMT<p>Welcome to the August Heartbeat! As we all soak in the remaining summer days, swing along in your hammock and take in all the great news from the Iterative Community!</p> <p><img src="https://media.giphy.com/media/2uI9paIuAWgaqfyX0Q/giphy.gif" alt="Ukulele Hammock GIF by Northern Illinois University"></p> <h1 id="from-greater-aiml-community" style="position:relative;">From Greater AI/ML Community<a href="#from-greater-aiml-community" aria-label="from greater aiml community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="vanishing-gradients-podcast" style="position:relative;">Vanishing Gradients Podcast<a href="#vanishing-gradients-podcast" aria-label="vanishing gradients podcast permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1b59357e3392d514f5968b677fc40465/e2d37/vanishing-gradients.png" alt="Vanishing Gradients" title="Vanishing Gradients" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> If you are not familiar with <a href="https://twitter.com/hugobowne" target="_blank" rel="nofollow noopener noreferrer"><strong>Hugo Bowne-Anderson</strong></a>, you should be. He was the host of my all-time favorite Data Science podcast <a href="https://www.datacamp.com/podcast" target="_blank" rel="nofollow noopener noreferrer">DataFramed</a> while he was at <a href="https://www.datacamp.com/" target="_blank" rel="nofollow noopener noreferrer">DataCamp</a>. DataFramed helped me immeasurably when I started my data science journey. It provided great not only great teachings on many data science concepts, but even more importantly, the ability to gain perspectives from different people across all parts of the data space, talking about challenges, danger zones, and issues that we all need to be aware of in the field. Recently Hugo started a new podcast, <a href="https://vanishinggradients.fireside.fm/" target="_blank" rel="nofollow noopener noreferrer">Vanishing Gradients</a>. This newer endeavor is in a somewhat different format than DataFramed, but still with Hugo's characteristic deep dive into all the challenges that come up when working with data. Hugo uses a long-format conversation approach with many leaders and great thinkers in the data science/machine learning/AI space. In episodes <a href="https://vanishinggradients.fireside.fm/7" target="_blank" rel="nofollow noopener noreferrer">seven</a> and <a href="https://vanishinggradients.fireside.fm/8" target="_blank" rel="nofollow noopener noreferrer">eight,</a> Hugo has a fascinating chat with <a href="https://twitter.com/pwang" target="_blank" rel="nofollow noopener noreferrer"><strong>Peter Wang</strong></a>, CEO of Anaconda, in which they talk about a number of topics including how Python became so big in Data Science, the emergence of open source collaborative environments, and things that the PyData stack solves. Then it gets really interesting as they dive into the open source model in the context of finite and infinite games and open source software as a "paradigm of humanity's ability to create generative, nourishing and anti-rivalrous systems." 🤯 Super interesting discussion and food for thought. I've already listened to both episodes twice. I highly recommend them and this new podcast in general.</p> <h1 id="from-the-iterative-tools-community" style="position:relative;">From the Iterative tools Community<a href="#from-the-iterative-tools-community" aria-label="from the iterative tools community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="mikołaj-kania---can-dvc-be-used-for-kaggle" style="position:relative;"><strong>Mikołaj Kania</strong> - Can DVC Be Used for Kaggle?<a href="#miko%C5%82aj-kania---can-dvc-be-used-for-kaggle" aria-label="mikołaj kania can dvc be used for kaggle permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/MikolajKania" target="_blank" rel="nofollow noopener noreferrer"><strong>Mikołaj Kania</strong></a> suggests that you upgrade your Kaggle competition workflow from the “spaghetti code” of Jupyter Notebooks and use the more mature way of creating reproducible ML results by using DVC <a href="https://mikolajkania.com/2022/08/07/dvc-kaggle-mlops/" target="_blank" rel="nofollow noopener noreferrer">here on his blog</a>.</p> <p>He notes that notebooks are really bad to compare changes between runs. Instead, he suggests developing a workflow where for every major experiment type, creating a branch - experimenting in each and persisting the best and most notable outcomes (good and bad). The best results are then submitted to Kaggle. You can find more about his workflow in <a href="https://github.com/mikolajkania/kaggle-03-house-prices" target="_blank" rel="nofollow noopener noreferrer">his repo for the project.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/58af1bbd7433b00332e27cde7b9dd4e1/39600/kaggle-dvc.png" alt="Using DVC for Kaggle Competition" title="Using DVC for Kaggle Competition" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC with Kaggle (<a href="https://mikolajkania.com/2022/08/07/dvc-kaggle-mlops/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>Mikołaj explains how DVC's project structure ensures reproducible results and develops habits on best practices. One drawback he noted was the lack of an experimentation UI, but we just introduced the DVC extension for VS Code to help with that, and there’s always Iterative Studio. Look out for improvement to the experiment features in both tools in the coming months! Also, experimenting with DVC in Kaggle may give you some good practice for things we are cooking up internally! 😉🤫</p> <h2 id="shambhavi-mishra---searching-for-semantic-similarity" style="position:relative;"><strong>Shambhavi Mishra</strong> - Searching for Semantic Similarity<a href="#shambhavi-mishra---searching-for-semantic-similarity" aria-label="shambhavi mishra searching for semantic similarity permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/ShambhaviCodes" target="_blank" rel="nofollow noopener noreferrer"><strong>Shambhavi Mishra</strong></a> in her post <a href="https://medium.com/towards-artificial-intelligence/searching-for-semantic-similarity-cfbff2388d04" target="_blank" rel="nofollow noopener noreferrer">Searching for Semantic Similarity</a> details the steps of her NLP project on similarity algorithms. She mainly focuses on cosine similarity using a Stack Overflow questions dataset. The end-to-end project uses Sentence BERT, Fast Text, DVC, DAGsHub, Streamlit and deploys the web app on an AWS EC2 instance.</p> <p>Once you follow all the steps you will have computed the similarity between a search query and a database of texts and rank all the data by their similarity score to retrieve the most similar text to its index.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3f7f7d61528e38fc14c23b13223680c/39600/cosine-similarity.png" alt="Cosine Similarity" title="Cosine Similarity" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Understanding Cosine Similarity (<a href="https://www.oreilly.com/library/view/mastering-machine-learning/9781785283451/ba8bef27-953e-42a4-8180-cea152af8118.xhtml" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="evgenii-munin---run-s3-locally-with-minio-for-the-dvc-machine-learning-pipeline" style="position:relative;"><strong>Evgenii Munin</strong> - Run S3 Locally With MinIO for the DVC Machine Learning Pipeline<a href="#evgenii-munin---run-s3-locally-with-minio-for-the-dvc-machine-learning-pipeline" aria-label="evgenii munin run s3 locally with minio for the dvc machine learning pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you are in need of object storage to work with data through an API, but need to do so in a private network, <a href="https://www.linkedin.com/in/evgenii-munin-01932a143/" target="_blank" rel="nofollow noopener noreferrer"><strong>Evgenii Munin</strong></a> shows how to set up MinIO as remote storage with DVC to do just that <a href="https://betterprogramming.pub/run-s3-locally-with-minio-for-dvc-machine-learning-pipeline-7fa3d240d3ab" target="_blank" rel="nofollow noopener noreferrer">in this piece in Medium</a>. In this cool use case, he starts with installing the MinIO server and builds a Docker image to run it, sharing a great repo on Kafka-to S3 where MinIO was used to mock the S3 for the data. Then he shows you how to link the MinIO server as DVC remote storage.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3122dc4b5f05f5669546d3d8fe06f7d2/39600/minio.png" alt="Minio Browser" title="Minio Browser" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Minio Browser with Data pushed from DVC (<a href="https://betterprogramming.pub/run-s3-locally-with-minio-for-dvc-machine-learning-pipeline-7fa3d240d3ab" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="caleb-kaiser---moving-from-data-science-to-machine-learning-engineering" style="position:relative;"><strong>Caleb Kaiser</strong> - Moving from Data Science to Machine Learning Engineering<a href="#caleb-kaiser---moving-from-data-science-to-machine-learning-engineering" aria-label="caleb kaiser moving from data science to machine learning engineering permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>It can sometimes be confusing to determine where data science stops and machine learning engineering starts. <a href="https://twitter.com/KaiserFrose" target="_blank" rel="nofollow noopener noreferrer"><strong>Caleb Kaiser</strong></a> helps clarify this <a href="https://www.kdnuggets.com/2020/11/moving-data-science-machine-learning-engineering.html" target="_blank" rel="nofollow noopener noreferrer">in this old but good piece</a> in <a href="https://www.kdnuggets.com" target="_blank" rel="nofollow noopener noreferrer">KD Nuggets</a>. He provides four examples of real- world projects and defines what portions of the project are data science and what are ML engineering. In all what we find is that machine learning engineering is all the tasks that need to happen to get the model the data scientists create into production applications.</p> <p>He goes on to dive deeper into one of the examples and shows the promise in some tools that bridge the gap between machine learning and software engineering where he highlights DVC and Huggingface. This is a good piece to read if you are grappling with the difference!</p> <p><img src="https://media.giphy.com/media/xUNd9DLukkavmhybAs/giphy.gif" alt="Season 2 Episode 6 GIF by Portlandia"></p> <h2 id="just-a-few-other-things" style="position:relative;">Just a few other things…<a href="#just-a-few-other-things" aria-label="just a few other things permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li>GitHub Goodness alert for <a href="https://github.com/instill-ai/vdp" target="_blank" rel="nofollow noopener noreferrer">Visual Data Preparation (VDP),</a> an open-source visual data ETL tool to streamline the end-to-end visual data processing pipeline. Among the highlights: a fast way to build end-to-end visual data pipelines, pre-built ETL data connectors, and integration with DVC</li> <li><a href="https://twitter.com/jillianerowe" target="_blank" rel="nofollow noopener noreferrer"><strong>Jillian Rowe</strong></a> gave a shout-out to DVC on a <a href="https://topenddevs.com/podcasts/adventures-in-devops/episodes/the-intersection-of-data-and-devops-devops-124" target="_blank" rel="nofollow noopener noreferrer">recent podcast</a> from <a href="https://topenddevs.com/podcasts/adventures-in-devops" target="_blank" rel="nofollow noopener noreferrer">Adventures in DevOps Podcast</a> in an episode where they discuss the intersection of data and DevOps</li> <li>If you are interested in contributing to researchers' learning about machine learning experimentation tools, you can take <a href="https://www.freelancer.com.au/projects/machine-learning/Seeking-Qualified-Respondents-for-Online-34294453.html" target="_blank" rel="nofollow noopener noreferrer">this survey</a>. Spread the word!</li> </ul> <h2 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="-model-registry-released-in-iterative-studio" style="position:relative;">🎉 Model Registry released in Iterative Studio<a href="#-model-registry-released-in-iterative-studio" aria-label=" model registry released in iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>On July 26th we released our new <a href="https://iterative.ai/model-registry" target="_blank" rel="nofollow noopener noreferrer">model registry in Iterative Studio.</a><br> The great work done by the MLEM team building a git-based model registry is now incorporated in Studio in a web UI. This release took the work of half the people in the company and we are proud of the steps we are taking to meet people where they are and round out your options whether you are comfortable in the CLI, API, or web UI. Be sure to try it out and give us your feedback. Learn more <a href="https://dvc.org/blog/iterative-studio-model-registry" target="_blank" rel="nofollow noopener noreferrer">in the blog post</a> and <a href="https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry" target="_blank" rel="nofollow noopener noreferrer">in the docs</a>. Look out for a full tutorial coming soon!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DYeVI-QrHGI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="-iteratives-first-internal-hackathon" style="position:relative;">🧑🏽‍💻 Iterative's First Internal Hackathon<a href="#-iteratives-first-internal-hackathon" aria-label=" iteratives first internal hackathon permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Last week we had our very first internal Hackathon! The entire company participated in the 48-hour computer vision challenge classifying dogs, cats, croissants and muffins. Part of the objective was to familiarize ourselves and test a new tool that we are expecting to release later this year.</p> <p>Eight teams competed for prizes for the best outcome, but also for the best integrations with other tools, the best dog, cat, croissant, and muffin photos from team members, and the best notes from the experience. I think the notes of our newest DevRel <a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> are in good running for the prize. (Learn more about Gema in the New Hires section below!)</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/798cc42eba6e04b60100fb9f4f5d0d4f/03346/gema-hackathon-notes.jpg" alt="Gema Parreño Piqueras' Hackathon notes" title="Gema Parreño Piqueras' Hackathon notes" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Gema Parreño Piqueras' Hackathon notes (<a href="https://twitter.com/SoyGema/status/1558135976698028034?s=20&t=lXyAWLISwf8gUl8SZS84AQ" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>See the members of the winning teams below. Team members <a href="https://www.linkedin.com/in/danielkharitonov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Kharitonov</strong></a> and <a href="https://www.linkedin.com/in/jon-burdo-59730a83/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jon Burdo</strong></a> organized the whole event and put together an extremely comprehensive document to help guide the teams. We are looking forward to more of these events in the future!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7f9d897ef06e99ea1e5f9cddc70b8413/03346/winners.jpg" alt="Winners of the First Iterative Hackathon" title="Winners of the First Iterative Hackathon" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Winners of the first Iterative Internal Hackathon, Source: Dmitry Petrov</em></p> <h3 id="-dmitry-petrov-in-ai-techpark-and-the-new-stack" style="position:relative;">📰 Dmitry Petrov in AI Techpark and The New Stack<a href="#-dmitry-petrov-in-ai-techpark-and-the-new-stack" aria-label=" dmitry petrov in ai techpark and the new stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> gives a sneak peek into the recent developments at Iterative.ai, highlights the most exciting trends, and shares about his entrepreneurial journey <a href="https://ai-techpark.com/aitech-interview-with-dmitry-petrov-co-founder-ceo-at-iterative-ai/" target="_blank" rel="nofollow noopener noreferrer">in this article</a> in <a href="https://ai-techpark.com/ai/" target="_blank" rel="nofollow noopener noreferrer">AI Techpark.</a></p> <p>Dmitry also wrote a piece for <a href="https://thenewstack.io/" target="_blank" rel="nofollow noopener noreferrer">The NewStack</a> entitled <a href="https://thenewstack.io/why-we-built-an-open-source-ml-model-registry-with-git/" target="_blank" rel="nofollow noopener noreferrer">Why We Built an Open Source ML Model Registry with Git</a>. As the title suggests the why is here as well as learnings from our customers' use cases, and the realization of the need for Model Registry as Code (MRaC), thus continuing our GitOps approach to tool building for machine learning.</p> <h2 id="david-de-la-iglesia-castro---making-mlops-uncool-again" style="position:relative;"><strong>David de la Iglesia Castro</strong> - Making MLOps Uncool Again<a href="#david-de-la-iglesia-castro---making-mlops-uncool-again" aria-label="david de la iglesia castro making mlops uncool again permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you haven't gotten a chance to make it to the conferences where <a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia Castro</strong></a> presented his popular talk or workshop entitled <a href="https://www.youtube.com/watch?v=J6fduKE1j1g" target="_blank" rel="nofollow noopener noreferrer">Making MLOps Uncool Again</a>, you can now catch it on our very own <a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a>! In this presentation you will learn how to build an MLOps workflow by extending the power of Git and GitHub with open-source tools DVC and CML. In the end, you will have an automated workflow that covers the entire lifecycle of an ML model, from data labeling to monitoring predictions. <a href="https://github.com/iterative/workshop-uncool-mlops" target="_blank" rel="nofollow noopener noreferrer">Find the repo for the project here.</a> And the <a href="https://github.com/iterative/workshop-uncool-mlops-solution" target="_blank" rel="nofollow noopener noreferrer">solution here</a>.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/J6fduKE1j1g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="new-hires" style="position:relative;">New hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> joins our team from Madrid, Spain as a Developer Advocate. You may have already been familiar with Gema if you've been taking our <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">online course</a> this summer because of the <a href="https://twitter.com/SoyGema/status/1558135976698028034?s=20&t=pJAfd-S4aoKGf4UhsnlgCw" target="_blank" rel="nofollow noopener noreferrer">gorgeous notes</a> she contributed per module. Gema was born and raised as an Architect (of buildings) but switched to tech a while back. She had her own video game start-up and has also worked as a Data Scientist in the Financial Industry. She has contributed to open source StarCraft II ML project. Gema loves indie games, puzzles, and croquettes! She makes the 4th teammate from España! 🇪🇸</p> <p><a href="https://www.linkedin.com/in/marcinjasion/" target="_blank" rel="nofollow noopener noreferrer"><strong>Marcin Jasion</strong></a> joins the team as a Senior Platform Engineer from Poland. He has been friends with team member, Paweł Redzyński, for years. When not working he likes travelling and eating, motorcycling, and is an avid cross-fitter. He also has a cat that likes to be a part of meetings! 🐈</p> <p><a href="https://www.linkedin.com/in/domasmonkus/" target="_blank" rel="nofollow noopener noreferrer"><strong>Domas Monkus</strong></a> joins the CML team as an engineer from Lithuania. Before joining us at Iterative, Domas spent 10 years at Canonical working on juju, livepatch, and many internal projects. He's a husband and father with a house outside the hustle and bustle of the city, so he mentioned that lawn mowing is one of his main free time activities. 🏡</p> <h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This week is <a href="https://ai4.io/" target="_blank" rel="nofollow noopener noreferrer">AI4</a>! <a href="https://twitter.com/fullstackml?lang=en" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> will give a talk as well as participate in a panel discussion on MLOps. If you are attending, stop by the booth and say hi or check out one of the in-booth demos we will have on our tools throughout the day.</p> <p>Additional conferences we will be attending this year:</p> <ul> <li><a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer"><strong>Gema Parreño Piqueras</strong></a> and our lead docs writer, <a href="https://twitter.com/JorgeOrpinel" target="_blank" rel="nofollow noopener noreferrer"><strong>Jorge Orpinel Perez</strong></a> will be heading to Mexico City August 31-September 1st for the <a href="https://www.latam-ai.com/" target="_blank" rel="nofollow noopener noreferrer">LATAM AI Conference</a>. Gema will give a presentation on experimentation in our new <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension for VS Code</a>.</li> <li><a href="https://www.southerndatascience.com/" target="_blank" rel="nofollow noopener noreferrer">Southern Data Science Conference</a> in Atlanta, GA on September 8-9th.</li> <li><a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a> in San Francisco</li> <li><a href="https://deeplearningworld.de/" target="_blank" rel="nofollow noopener noreferrer">Deep Learning World</a> - Berlin</li> <li><a href="https://www.re-work.co/events/mlops-summit-2022" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit - Re-work</a> - London</li> <li>Dmitry Petrov will be speaking at <a href="https://www.githubuniverse.com/" target="_blank" rel="nofollow noopener noreferrer">GitHub Universe</a> on November 9-10!</li> <li><a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a>- Toronto</li> </ul> <p>We also will be reviving our virtual meetups this fall so be sure to <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">join our group on Meetup.</a></p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the open positions. Please share with anyone looking to have a lot of fun building the next generation of machine learning to production tools! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b06f20b39d5f8146f4baadac1aa90e0b/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="-doc-updates" style="position:relative;">✍🏼 Doc Updates<a href="#-doc-updates" aria-label=" doc updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>As noted above there are <a href="https://dvc.org/doc/studio/user-guide/model-registry/what-is-a-model-registry" target="_blank" rel="nofollow noopener noreferrer">new docs for Iterative Studio's Model Registry</a></li> <li>In case you missed it, CML now supports <a href="https://bitbucket.org/product" target="_blank" rel="nofollow noopener noreferrer">Bitbucket</a>! You can find the <a href="https://cml.dev/doc/start/bitbucket#get-started-with-cml-on-bitbucket" target="_blank" rel="nofollow noopener noreferrer">docs for the Bitbucket integration here</a>.</li> </ul> <h3 id="-blog-post" style="position:relative;">✍🏼 Blog post<a href="#-blog-post" aria-label=" blog post permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>💎 Don't miss <a href="https://dvc.org/blog/july-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">July's Community Gems</a> is full of great questions from the Community.</li> <li><a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> provides a new tutorial for <a href="https://dvc.org/blog/serving-models-with-mlem" target="_blank" rel="nofollow noopener noreferrer">Serving Machine Learning Models with MLEM.</a> Don't miss it!</li> </ul> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Once again we have a tie for best Tweet! Looking forward to seeing the video on this one from <a href="https://twitter.com/AvikalpGupta" target="_blank" rel="nofollow noopener noreferrer"><strong>Avikalp Kumar Gupta</strong></a>!🍿 You can find the slides <a href="https://drive.google.com/file/d/1-iOgtVDWG13A9MxRDet246Gnbdrkb0vv/view" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/microwin?src=hash&ref_src=twsrc%5Etfw">#microwin</a> of the day:<br><br>Spoke at <a href="https://twitter.com/hashtag/GCCDBLR?src=hash&ref_src=twsrc%5Etfw">#GCCDBLR</a> '22 (annual flagship event by <a href="https://twitter.com/gdgcblr">@gdgcblr</a>) about setting up effective <a href="https://twitter.com/hashtag/DataScience?src=hash&ref_src=twsrc%5Etfw">#DataScience</a> teams. And shared with everyone, how tools like <a href="https://twitter.com/hashtag/git?src=hash&ref_src=twsrc%5Etfw">#git</a> <a href="https://twitter.com/github">@github</a> <a href="https://twitter.com/DVCorg">@DVCorg</a> <a href="https://twitter.com/ProjectJupyter">@ProjectJupyter</a> Jupytext and <a href="https://twitter.com/vibinex">@vibinex</a> can make it easier.<a href="https://twitter.com/hashtag/technology?src=hash&ref_src=twsrc%5Etfw">#technology</a> <a href="https://twitter.com/hashtag/startup?src=hash&ref_src=twsrc%5Etfw">#startup</a> <a href="https://twitter.com/hashtag/day38?src=hash&ref_src=twsrc%5Etfw">#day38</a> <a href="https://t.co/GBLXa9OGAO">pic.twitter.com/GBLXa9OGAO</a></p>— Avikalp Kumar Gupta (@AvikalpGupta) <a href="https://twitter.com/AvikalpGupta/status/1556609442908884994">August 8, 2022</a></blockquote> <p>Also so great to have our new DVC extension shouted out by <a href="https://twitter.com/HaroldSinnott" target="_blank" rel="nofollow noopener noreferrer"><strong>Harold Sinnot</strong></a>!</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">10 VScode extensions every data scientist should have💻🤖<br><br>1. Python<br>2. Pylance<br>3. Python Indent<br>4. Jupyter<br>5. Jupyter notebook renderers<br>6. DVC - (ML model experiment tracking)<br>7. Gitlens<br>8. Todo MD<br>9. Excel viewer<br>10. Markdown preview GitHub styling<br><br>via <a href="https://twitter.com/avikumart_">@avikumart_</a> <a href="https://twitter.com/hashtag/AI?src=hash&ref_src=twsrc%5Etfw">#AI</a> <a href="https://twitter.com/hashtag/IoT?src=hash&ref_src=twsrc%5Etfw">#IoT</a></p>— Harold Sinnott 🇺🇸 (@HaroldSinnott) <a href="https://twitter.com/HaroldSinnott/status/1545058509087092736">July 7, 2022</a></blockquote> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/july-22-community-gemshttps://dvc.org/blog/july-22-community-gemsTue, 26 Jul 2022 00:00:00 GMT<h2 id="how-can-i-track-a-new-file-added-to-my-data-folder-if-the-data-folder-is-already-tracked-by-dvc-yet-ignored-by-git" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/983278896587894804" target="_blank" rel="nofollow noopener noreferrer">How can I track a new file added to my <code>data</code> folder if the <code>data</code> folder is already tracked by DVC, yet ignored by Git?</a><a href="#how-can-i-track-a-new-file-added-to-my-data-folder-if-the-data-folder-is-already-tracked-by-dvc-yet-ignored-by-git" aria-label="how can i track a new file added to my data folder if the data folder is already tracked by dvc yet ignored by git permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Great question on how DVC handles data tracking from @NgHoangDat!</p> <p>Since you already track the <code>data</code> folder, when you add a new file into it, all you need to do is update your DVC history. You can use either <a href="https://dvc.org/doc/command-reference/add"><code>dvc add data</code></a> or <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> to start tracking the new file.</p> <p>DVC will also only recalculate the changed files. If you add or modify a small number of files in that folder, the update will not take very long.</p> <h2 id="what-would-be-the-best-method-to-get-the-remote-url-of-a-given-dataset-inside-a-python-environment" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/984870485668008007" target="_blank" rel="nofollow noopener noreferrer">What would be the best method to get the remote URL of a given dataset inside a Python environment?</a><a href="#what-would-be-the-best-method-to-get-the-remote-url-of-a-given-dataset-inside-a-python-environment" aria-label="what would be the best method to get the remote url of a given dataset inside a python environment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Wonderful question from @come_arvis!</p> <p>You can use the <code>get_url</code> method of the <a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer">DVC Python API</a> to do this. Here's an example of a script you might run to get the remote URL.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api resource_url <span class="token operator">=</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>get_url<span class="token punctuation">(</span> <span class="token string">'get-started/data.xml'</span><span class="token punctuation">,</span> repo<span class="token operator">=</span><span class="token string">'https://github.com/iterative/dataset-registry'</span> <span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>resource_url<span class="token punctuation">)</span> <span class="token comment"># https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355</span></code></pre></div> <p>This URL is built with the remote URL from the project configuration file, <code>.dvc/config</code>, and the <code>md5</code> file hashes stored in the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file corresponding to the data file or directory you want the storage location of.</p> <h2 id="im-excited-about-mlem-helping-expose-api-endpoints-to-our-model-but-heard-it-was-experimental-where-can-i-learn-more-about-how-to-deploy-models-with-this-tool" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/992517466662117386" target="_blank" rel="nofollow noopener noreferrer">I'm excited about MLEM helping expose API endpoints to our model, but heard it was experimental. Where can I learn more about how to deploy models with this tool?</a><a href="#im-excited-about-mlem-helping-expose-api-endpoints-to-our-model-but-heard-it-was-experimental-where-can-i-learn-more-about-how-to-deploy-models-with-this-tool" aria-label="im excited about mlem helping expose api endpoints to our model but heard it was experimental where can i learn more about how to deploy models with this tool permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Great question from @raveman^2!</p> <p>There are a few ways you can use expose API endpoints to your model:</p> <ul> <li>Run <code>mlem serve</code> to generate a FastAPI endpoint with your model.</li> <li>Export the model as a Python package for your own custom-built API.</li> <li>The experimental deploy to Heroku.</li> </ul> <p>You can find more details here in the MLEM docs: <a href="https://mlem.ai/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">https://mlem.ai/doc/get-started</a></p> <p>You can also see an example of deploying a model with MLEM in this <a href="https://dvc.org/blog/serving-models-with-mlem" target="_blank" rel="nofollow noopener noreferrer">blog post tutorial</a>.</p> <h2 id="how-do-i-revert-a-dvc-add-command-to-stop-tracking-data" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/993111134896918599" target="_blank" rel="nofollow noopener noreferrer">How do I revert a <code>dvc add</code> command to stop tracking data?</a><a href="#how-do-i-revert-a-dvc-add-command-to-stop-tracking-data" aria-label="how do i revert a dvc add command to stop tracking data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This is a good question from @Nwoke!</p> <p>If you have accidentally added the wrong directory or files for DVC to track, you can easily remove them with the <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a> command. This is used to remove the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file and ensure that the original data file is no longer being tracked. Here's an example of this command being used:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remove</span> data.csv.dvc</span></code></pre></div> <p>Sometimes when you stop tracking data, you also want to remove it from your cache. You can do this with the <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> command, which will remove all data, not just the target of <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a>. If you want to remove all of the data and its previous versions from the cache, you can do that with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc gc</span> <span class="token parameter variable">-w</span></span></code></pre></div> <p>The <code>-w</code> option only keeps the files and directories referenced in the workspace, so once you have removed the data you don't want to track, this is how DVC knows what to keep and what to discard.</p> <p>You can learn more about removing tracked data in <a href="https://dvc.org/doc/user-guide/how-to/stop-tracking-data" target="_blank" rel="nofollow noopener noreferrer">the docs here</a>.</p> <h2 id="is-it-normal-for-the-outs-of-a-stage-to-be-removed-when-dvc-repro-is-run" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/993781745524691087" target="_blank" rel="nofollow noopener noreferrer">Is it normal for the <code>outs</code> of a stage to be removed when <code>dvc repro</code> is run?</a><a href="#is-it-normal-for-the-outs-of-a-stage-to-be-removed-when-dvc-repro-is-run" aria-label="is it normal for the outs of a stage to be removed when dvc repro is run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Fantastic question from @Nish!</p> <p>This is the expected behavior of DVC. It removes the <code>outs</code> of a stage unless the <code>persist:true</code> value is set for that output. You can learn more about how this works in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#output-subfields" target="_blank" rel="nofollow noopener noreferrer">our docs here</a>. Here's an example of a stage with the <code>persist</code> value set.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> date <span class="token punctuation">></span> data/external/date <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">data/external</span><span class="token punctuation">:</span> <span class="token key atrule">persist</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div> <p>Even if you don't persist your <code>outs</code>, you can still check out an older version of the pipeline to get older <code>outs</code> with <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>. This is based on what's in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> and <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files and it will update your workspace to match the experiment you check out. This is usually run after checking out a different Git branch. So the flow might look like:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> experiment-branch </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span></span></code></pre></div> <p>These commands allow you to get the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> and <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files for the experiment you want to go back to from your Git history. Then it uses DVC to get your data to the version you want and reproduce your entire experiment. You can learn more about these details in <a href="https://dvc.org/doc/command-reference/checkout" target="_blank" rel="nofollow noopener noreferrer">the <code>dvc checkout</code> docs here</a>.</p> <h2 id="is-there-a-way-to-have-a-plot-with-multiple-y-axes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/994685566698410055" target="_blank" rel="nofollow noopener noreferrer">Is there a way to have a plot with multiple y-axes?</a><a href="#is-there-a-way-to-have-a-plot-with-multiple-y-axes" aria-label="is there a way to have a plot with multiple y axes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Wonderful question from @shortcipher3!</p> <p>If you update DVC to version <code>2.12.1</code> and higher, you should be able to define multiple y-axes in your DVC pipeline. Here's an example of how this may look in a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># dvc.yaml</span> <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token punctuation">...</span> <span class="token key atrule">plots</span><span class="token punctuation">:</span> <span class="token key atrule">some_file.csv</span><span class="token punctuation">:</span> <span class="token key atrule">x</span><span class="token punctuation">:</span> x_column_name <span class="token key atrule">y</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col1<span class="token punctuation">,</span> col2<span class="token punctuation">,</span> col3<span class="token punctuation">]</span> <span class="token comment"># alternative 1:</span> <span class="token key atrule">multiple_rocs</span><span class="token punctuation">:</span> <span class="token key atrule">x</span><span class="token punctuation">:</span> x_column_name <span class="token key atrule">y</span><span class="token punctuation">:</span> <span class="token key atrule">some_file.csv</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col1<span class="token punctuation">,</span> col2<span class="token punctuation">,</span> col3<span class="token punctuation">]</span> <span class="token comment"># in case of multiple files:</span> <span class="token key atrule">multiple_rocs_from_multiple_files</span><span class="token punctuation">:</span> <span class="token key atrule">x</span><span class="token punctuation">:</span> x_column_name <span class="token key atrule">y</span><span class="token punctuation">:</span> <span class="token key atrule">file1.csv</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col1<span class="token punctuation">,</span> col2<span class="token punctuation">]</span> <span class="token key atrule">file2.csv</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>col3<span class="token punctuation">]</span></code></pre></div> <p>A quick note, make sure that <code>plots</code> is on the same level as <code>stages</code> in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.</p> <h2 id="how-do-you-structure-the-dvcyaml-file-to-run-in-stages-in-a-specific-order" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/991000853278232616" target="_blank" rel="nofollow noopener noreferrer">How do you structure the <code>dvc.yaml</code> file to run in stages in a specific order?</a><a href="#how-do-you-structure-the-dvcyaml-file-to-run-in-stages-in-a-specific-order" aria-label="how do you structure the dvcyaml file to run in stages in a specific order permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Awesome question from @srb302!</p> <p>You would need to set up outputs and dependencies for each stage. So a stage that is run first would generate an output and the stage that is suppose to run second would use the first stage's output as a dependency.</p> <p>Otherwise, DVC does not guarantee any particular execution order for stages which are independent of each other. DVC determines the structure of your DAG based on file outputs and dependencies and there isn't another way to enforce order of stage execution in DVC.</p> <h2 id="how-do-i-know-when-i-should-track-a-file-with-git-or-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/993120910095699978" target="_blank" rel="nofollow noopener noreferrer">How do I know when I should track a file with Git or DVC?</a><a href="#how-do-i-know-when-i-should-track-a-file-with-git-or-dvc" aria-label="how do i know when i should track a file with git or dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This is a really good question from @vadim.sukhov!</p> <p>Let's take a look at an example <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">evaluate</span><span class="token punctuation">:</span> <span class="token punctuation">...</span> <span class="token key atrule">plots</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">prc.json</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">x</span><span class="token punctuation">:</span> recall <span class="token key atrule">y</span><span class="token punctuation">:</span> precision <span class="token punctuation">-</span> <span class="token key atrule">roc.json</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">x</span><span class="token punctuation">:</span> fpr <span class="token key atrule">y</span><span class="token punctuation">:</span> tpr</code></pre></div> <p>In this scenario, the <code>prc.json</code> and <code>roc.json</code> files are <strong>not</strong> being tracked by DVC because of the <code>cache: false</code> value. Since these files aren't tracked by DVC, they aren't saved to a remote storage location outside of Git, like data files are. So if you have <code>cache: false</code> on a file that you want to keep track of, you'll need to Git commit them to your project.</p> <hr> <p><img src="https://media.giphy.com/media/pdSncNyYgaH0wqaCqp/giphy.gif" alt="Duck Dynasty GIF by DefyTV"></p> <p>Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date with us to find out what it is! <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">Join our group</a> to stay up to date with specifics as we get closer to the event!</p> <p>Check out <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">our docs</a> to get all your DVC, CML, and MLEM questions answered!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to chat with the community!</p>https://dvc.org/blog/iterative-studio-model-registryhttps://dvc.org/blog/iterative-studio-model-registryTue, 26 Jul 2022 00:00:00 GMT<p>Machine learning tasks are iterative by nature. Over time, you build several versions of your ML models, which could be in different stages of production-readiness. A version may be running in production, another version that seems to perform better may be in staging, and a couple more versions could be in active development by you and your teammates - using updated hyperparameters, datasets, or algorithms.</p> <p>How do you keep track of all your models, their versions, and deployment statuses? How do you get answers to questions like these easily:</p> <ul> <li>Which model version is currently in production?</li> <li>When was the last time this model was updated?</li> </ul> <p>If you are like some of the data scientists we know, you may have a Google sheet or a Notion page with the list of all your models, their changes, deployment history, and so on. But this is highly error-prone and will probably get out-of-date very quickly. Or maybe you upload all your models to a cloud bucket and “attach” text reports to them. Not very maintainable or searchable either. We’ve even seen people use sticky notes, or better yet, rely on their memory 😀.</p> <p>Some of the more organized folks use Model Registries - tools created specifically to organize models into a central, searchable repository. While this is definitely better than using random files or sticky notes, one major problem persists: the data science and machine learning team members work completely isolated from the software development and DevOps team members. This makes collaboration far more time consuming than it should be.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/408b5d1f21c83dd2dba8cc35b40238b6/39600/disconnected-silos.png" alt="Teams can work in disconnected silos" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Some even implement in-house systems, and maybe you are also planning to do so. But these can get expensive to develop and maintain.</p> <p><strong>We built the Iterative Studio Model Registry to solve these problems.</strong></p> <p>Iterative Studio Model Registry enables ML teams to collaborate on models by providing model organization, discovery, versioning, lineage (tracing the origin of the model), and the ability to manage deployment statuses such as, development, staging, and production across multiple projects.</p> <h2 id="utilize-your-existing-git-infrastructure" style="position:relative;">Utilize your existing Git infrastructure<a href="#utilize-your-existing-git-infrastructure" aria-label="utilize your existing git infrastructure permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Iterative Studio Model Registry is built on top of Git, which means:</p> <ul> <li>You can reuse your existing Git infrastructure to manage ML models together with code, data, experiment pipelines, and deployment statuses.</li> <li>You can use GitOps for model deployment, and trigger model deployment from Iterative Studio, which you can also use to run your ML experiments.</li> <li>DS/ML folks and Software/DevOps folks can work together more easily, because they utilize the same tools and infrastructure.</li> </ul> <h2 id="open-mlops" style="position:relative;">Open MLOps<a href="#open-mlops" aria-label="open mlops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A core philosophy at Iterative is open MLOps - we build tools that work with your infrastructure. Our toolstack is modular, so you can build your model registry on top of your existing cloud and DevOps infrastructure.</p> <p>Under the hood, Iterative Studio Model Registry uses Iterative’s open-source Git-based tools <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> and <a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>.</p> <ul> <li><a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> enables <a href="https://semver.org/" target="_blank" rel="nofollow noopener noreferrer">semantic versioning</a> and stage transitions of artifacts using metadata files and Git tags.</li> <li><a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a> saves ML models and extracts model metadata including framework, methods, input / output data schema, and requirements.</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/babf62e08455b02dae8de67684bc7a65/39600/modular-toolstack.png" alt="Iterative toolstack is modular" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h2 id="ui-of-your-choice" style="position:relative;">UI of your choice<a href="#ui-of-your-choice" aria-label="ui of your choice permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Iterative Model Studio Registry meets you where you are, through your favorite interface. Whether you like APIs, prefer a web interface, or work best in the command line; whatever your role or preference, we've got you covered so your team can be most efficient.</p> <h2 id="models-can-reside-anywhere" style="position:relative;">Models can reside anywhere<a href="#models-can-reside-anywhere" aria-label="models can reside anywhere permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Save your model files wherever works best for you, whether it’s in S3, GCP, or any other of your remote (or local) storages. Then, add them to the model registry in a non-intrusive, no-code fashion <strong>without modifying your ML training code</strong>. This saves you hours of valuable time.</p> <h2 id="collaborate-across-multiple-projects" style="position:relative;">Collaborate across multiple projects<a href="#collaborate-across-multiple-projects" aria-label="collaborate across multiple projects permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A central dashboard of all your models facilitates transparency and discovery across every project by your whole team.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d998d5277de4d7bf4f64506384f7c134/39600/models-dashboard.png" alt="Models are organized in a central dashboard" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>And on the model details page, you’ll find that information about the model is automatically detected and its history tracked.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/621a6f181d9bee2ab9071b9be4f845df/39600/model-details-page.png" alt="All models have separate model detail pages" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <admon type="tip"> <p>Try our <a href="https://studio.datachain.ai/team/Iterative/models" target="_blank" rel="nofollow noopener noreferrer">demo Model Registry</a> to get a feel for Iterative Studio's Model Registry features.</p> </admon> <h2 id="create-model-versions-and-stages-from-any-git-commit" style="position:relative;">Create model versions and stages from any Git commit<a href="#create-model-versions-and-stages-from-any-git-commit" aria-label="create model versions and stages from any git commit permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>For registering versions, select the commit and provide the version number. To assign stages, select the version and provide the stage name. It is as simple as that.</p> <h2 id="git-remains-the-single-source-of-truth-for-all-your-ml-projects" style="position:relative;">Git remains the single source of truth for all your ML projects<a href="#git-remains-the-single-source-of-truth-for-all-your-ml-projects" aria-label="git remains the single source of truth for all your ml projects permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here’s a brief explanation of how the model, version and stage information is stored in Git:</p> <ul> <li>The following entry in <code>artifacts.yaml</code> indicates that your <code>image-synthesis</code> model is stored in an <code>S3</code> bucket.</li> </ul> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">image-classifier-model</span><span class="token punctuation">:</span> <span class="token key atrule">description</span><span class="token punctuation">:</span> This model is used to classify images of different objects submitted by users. This version of the model has an accuracy of 95%. <span class="token key atrule">labels</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> Random Forest <span class="token punctuation">-</span> image classification <span class="token punctuation">-</span> sklearn <span class="token key atrule">path</span><span class="token punctuation">:</span> .mlem/model/image<span class="token punctuation">-</span>classifier<span class="token punctuation">-</span>model <span class="token key atrule">type</span><span class="token punctuation">:</span> model</code></pre></div> <p>In the following example, the Git tag <code>[email protected]</code> indicates that you created version <code>2.0.0</code> of your <code>image-classifier-model</code> from the Git commit <code>6c0fc85</code>.</p> <p>The Git tag <code>image-classifier-model#production#3</code> indicates that you assigned the <code>production</code> stage to version <code>2.0.0</code> of your model.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 394px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b8b68153a471f89bb3276775b475e26d/39600/git-tags.png" alt="Git tags represent model version and stage" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h2 id="a-single-platform-for-all-your-mlops-needs" style="position:relative;">A single platform for all your MLOps needs<a href="#a-single-platform-for-all-your-mlops-needs" aria-label="a single platform for all your mlops needs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Since its inception, Iterative Studio has brought together <a href="https://git-scm.com/" target="_blank" rel="nofollow noopener noreferrer">Git</a>, <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, and <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> for seamless data and model management, experiment tracking, visualization and automation. Now, by harnessing the power of <a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a> and <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> in its Model Registry, it makes your machine learning processes even more robust.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>With the Iterative Studio Model Registry, your ML model (dis)organization is not in chaos anymore. Collaborating on your ML projects becomes faster and your ML team members’ lives become much easier.</p> <p>Start using <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio Model Registry</a> today. And answer all the who, what, why, where and when questions of your team's model production directly from the information in your Git repository.</p> <p>Refer to the <a href="https://dvc.org/doc/studio/user-guide/model-registry" target="_blank" rel="nofollow noopener noreferrer">documentation and tutorials</a> to get started. To request support or share feedback, you can <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email me</a> or create a support ticket on <a href="https://github.com/iterative/studio-support" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DYeVI-QrHGI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>https://dvc.org/blog/serving-models-with-mlemhttps://dvc.org/blog/serving-models-with-mlemTue, 19 Jul 2022 00:00:00 GMT<p>Training a machine learning model is only one step in the process of getting something useful out to end-users. When it's time to deploy the model to production, there are a number of approaches you can take depending on the goal of the machine learning project. That might mean getting the model ready to respond to real-time queries coming from an API or batch processing predictions, for example.</p> <p>Either way, you'll need to save your trained and validated model in a format that's consumable by other systems. That's why we'll be covering how to serve models through a <a href="https://restfulapi.net/" target="_blank" rel="nofollow noopener noreferrer">REST</a> endpoint or a Python package with <a href="https://mlem.ai/" target="_blank" rel="nofollow noopener noreferrer">MLEM</a>.</p> <blockquote> <p>You can get the repo we're working with <a href="https://github.com/iterative/stale-model-example/tree/mlem-serve" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> </blockquote> <h2 id="take-a-candidate-model" style="position:relative;">Take a candidate model<a href="#take-a-candidate-model" aria-label="take a candidate model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are instructions in the project <a href="https://github.com/iterative/stale-model-example/tree/mlem-serve#readme" target="_blank" rel="nofollow noopener noreferrer">README</a> on how to get everything you need installed and running. This is a simple ML project that uses <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> for data versioning and experiment tracking.</p> <p>After you have the repo set up, you'll already have the <code>mlem</code> package installed. This project already has a model that's been trained and validated so we can move on to saving this model.</p> <h2 id="save-the-model" style="position:relative;">Save the model<a href="#save-the-model" aria-label="save the model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Inside the <code>train.py</code> script, we need to add the <code>mlem</code> import to save the models as we experiment. We don't have to worry about running the training script for this project since we have the model, but it's good to know what's happening under the hood.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># train.py</span> <span class="token keyword">import</span> os <span class="token keyword">import</span> pickle5 <span class="token keyword">as</span> pickle <span class="token keyword">import</span> sys <span class="token keyword">import</span> yaml <span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> save <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np <span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>ensemble <span class="token keyword">import</span> RandomForestClassifier <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div> <p>Then you can add the <code>save</code> function to the end of the training script.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># train.py</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> clf <span class="token operator">=</span> RandomForestClassifier<span class="token punctuation">(</span> n_estimators<span class="token operator">=</span>n_est<span class="token punctuation">,</span> min_samples_split<span class="token operator">=</span>min_split<span class="token punctuation">,</span> n_jobs<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">,</span> random_state<span class="token operator">=</span>seed <span class="token punctuation">)</span> clf<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>x<span class="token punctuation">,</span> labels<span class="token punctuation">)</span> save<span class="token punctuation">(</span> clf<span class="token punctuation">,</span> <span class="token string">"clf"</span><span class="token punctuation">,</span> sample_data<span class="token operator">=</span>x<span class="token punctuation">,</span> description<span class="token operator">=</span><span class="token string">"Random Forest Classifier"</span><span class="token punctuation">,</span> <span class="token punctuation">)</span></code></pre></div> <admon type="tip"> <p>You don't have to do these steps as we already have a model available, but if you want to see the training and evaluation steps in action, you reproduce the DVC pipeline with:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div> <p>You can check out what is happening in that pipeline by looking at the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file in the project.</p> <p>You can also see where we load the model into the <code>src/evaluate.py</code> script. To do that, you'll need to add the following import.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># evaluate.py</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token keyword">import</span> pickle5 <span class="token keyword">as</span> pickle <span class="token keyword">import</span> sklearn<span class="token punctuation">.</span>metrics <span class="token keyword">as</span> metrics <span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> <span class="token builtin">apply</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div> <p>Now we can use the <a href="https://mlem.ai/doc/api-reference/apply" target="_blank" rel="nofollow noopener noreferrer"><code>apply</code> function</a> to make predictions with the model.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># evaluate.py</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> x <span class="token operator">=</span> matrix<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token number">1</span><span class="token punctuation">:</span><span class="token number">11</span><span class="token punctuation">]</span><span class="token punctuation">.</span>values cleaned_x <span class="token operator">=</span> np<span class="token punctuation">.</span>where<span class="token punctuation">(</span>np<span class="token punctuation">.</span>isnan<span class="token punctuation">(</span>x<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> x<span class="token punctuation">)</span> labels_pred <span class="token operator">=</span> <span class="token builtin">apply</span><span class="token punctuation">(</span>model_file<span class="token punctuation">,</span> cleaned_x<span class="token punctuation">,</span> method<span class="token operator">=</span><span class="token string">"predict"</span><span class="token punctuation">)</span> predictions_by_class <span class="token operator">=</span> <span class="token builtin">apply</span><span class="token punctuation">(</span>model_file<span class="token punctuation">,</span> cleaned_x<span class="token punctuation">,</span> method<span class="token operator">=</span><span class="token string">"predict_proba"</span><span class="token punctuation">)</span> predictions <span class="token operator">=</span> predictions_by_class<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">]</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div> <p>The <code>predict</code> and <code>predict_proba</code> are methods available from the model and they are used to get new predicted values and their probabilities for evaluation. This, along with everything else in the script, is how we get the metrics for each experiment.</p> </admon> <p>After you run an experiment, there will be two new files in your repo: <code>clf</code> and <code>clf.mlem</code>. Make sure you add the <code>clf.mlem</code> file to your Git history with the following command:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> clf.mlem</span></code></pre></div> <p>This is so that the metadata is in your repo and ready to use with other MLEM commands. Now we can finally take the model file and ship it to production!</p> <h2 id="deploy-the-model-to-production" style="position:relative;">Deploy the model to production<a href="#deploy-the-model-to-production" aria-label="deploy the model to production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are a couple of ways you can do this with MLEM:</p> <ul> <li>Serve the model with <a href="https://fastapi.tiangolo.com/" target="_blank" rel="nofollow noopener noreferrer">FastAPI</a>.</li> <li>Create a Python package (and use or distribute it).</li> </ul> <p><em>Note:</em> There is an experimental option to <a href="https://mlem.ai/doc/get-started/deploying" target="_blank" rel="nofollow noopener noreferrer">deploy the model directly to Heroku</a> although this functionality is experimental and may have breaking changes.</p> <h3 id="serve-with-fastapi" style="position:relative;">Serve with FastAPI<a href="#serve-with-fastapi" aria-label="serve with fastapi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you don't have an API to work with and don't need a Python package, like if you're just testing a model, you can serve your model quickly using FastAPI with this command.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem serve</span> clf fastapi</span></code></pre></div> <p>This will run a local server and spin up a web API for you so you can quickly test out your model without needing a development team to work on the API initially.</p> <p>You'll see an output like this in your terminal:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem serve</span> clf fastapi </span>⏳️ Loading model from clf.mlem Starting fastapi server... 💅 Adding route for /predict 💅 Adding route for /predict_proba 💅 Adding route for /sklearn_predict 💅 Adding route for /sklearn_predict_proba Checkout openapi docs at <http://0.0.0.0:8080/docs> INFO: Started server process [31916] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)</code></pre></div> <p>Then, when you go to the local URL, you'll see the <a href="https://fastapi.tiangolo.com/features/#automatic-docs" target="_blank" rel="nofollow noopener noreferrer">documentation</a> for how to use the model you created.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7289114fcbec3359f45be08e125104f2/39600/fastapi_docs.png" alt="FastAPI ML model deployment" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>That's it! Now you know how to train a model, save it, and deploy to some external service quickly using MLEM!</p> <h3 id="custom-python-package" style="position:relative;">Custom Python package<a href="#custom-python-package" aria-label="custom python package permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let's take a look at making a Python package and importing it into a <a href="https://flask.palletsprojects.com/en/2.1.x/" target="_blank" rel="nofollow noopener noreferrer">Flask</a> web app. To make the Python package, we'll run the following MLEM command.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token mlem">mlem build</span> clf pip <span class="token parameter variable">-c</span> <span class="token assign-left variable">target</span><span class="token operator">=</span>build/ <span class="token parameter variable">-c</span> <span class="token assign-left variable">package_name</span><span class="token operator">=</span>bike_predictor</span></code></pre></div> <p>This takes our <code>clf.mlem</code> file and generates a Python package called <code>bike_predictor</code> in the <code>build</code> directory. When you look in your project, you should see that new <code>build</code> folder that has all of the files you need for an independent Python package.</p> <p>To build the package, you'll need to run the following command in the <code>build</code> directory.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> <span class="token parameter variable">-m</span> build <span class="token parameter variable">--wheel</span></span></code></pre></div> <p>Then go back to the top-level directory and run the following command to install your new model package.</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> ./build/dist/bike_predictor-0.0.0-py3-none-any.whln</span></code></pre></div> <p>Now you can import this to your Flask API like so.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># api.py</span> <span class="token keyword">import</span> os <span class="token keyword">from</span> flask <span class="token keyword">import</span> Flask<span class="token punctuation">,</span> jsonify<span class="token punctuation">,</span> request <span class="token keyword">from</span> flask_sqlalchemy <span class="token keyword">import</span> SQLAlchemy <span class="token keyword">from</span> flask_migrate <span class="token keyword">import</span> Migrate <span class="token keyword">from</span> flask_cors <span class="token keyword">import</span> CORS <span class="token keyword">from</span> dotenv <span class="token keyword">import</span> load_dotenv <span class="token keyword">import</span> bike_predictor <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div> <p>You can then use the <code>predict</code> method on new data and run any other tasks you need to in the API.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># api.py</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> new_event <span class="token operator">=</span> EventsModel<span class="token punctuation">(</span> title<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> date<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"date"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> time<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"time"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> location<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"location"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> description<span class="token operator">=</span>data<span class="token punctuation">[</span><span class="token string">"description"</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">)</span> db<span class="token punctuation">.</span>session<span class="token punctuation">.</span>add<span class="token punctuation">(</span>new_event<span class="token punctuation">)</span> db<span class="token punctuation">.</span>session<span class="token punctuation">.</span>commit<span class="token punctuation">(</span><span class="token punctuation">)</span> bike_predictor<span class="token punctuation">.</span>predict<span class="token punctuation">(</span>new_event<span class="token punctuation">)</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span></code></pre></div> <p>Then you can test this API out locally by running the following command:</p> <div class="gatsby-highlight" data-language="cli"><pre class="language-cli"><code class="language-cli"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> src/api.py</span></code></pre></div> <p>This will start up a local server on port 5000 and you'll be able to see your model in action. From here, this can be deployed to any cloud environment as long as you remember to include and install the model package.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In this post, we learned how easy it can be to deploy a model through FastAPI or through a Python package with MLEM. You can use this same process to train and serve any model through an API endpoint very quickly. This can help with validation, collaborating with team members, and it can help you see if there are any underlying issues in your overall deployment process before you hear about them from users. MLEM can also be used to create a model registry so you can store and switch between models whenever you need to.</p>https://dvc.org/blog/july-heartbeathttps://dvc.org/blog/july-heartbeatMon, 18 Jul 2022 00:00:00 GMT<details> <p>This month our cover image is inspired by a Community member <a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer">Gift Ojebulu</a>. Gift is a champion of Community and is a leader in the data movement in Nigeria. Recently he presented about DVC at the <a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer">Open Source Africa Conference</a>. He is also extremely involved doing amazing work building the data Community in Africa through <a href="https://datafestafrica.com/" target="_blank" rel="nofollow noopener noreferrer">Data Fest Africa</a>. We are lucky to have a Gift as a member of our own Community.</p> <summary>✨Image Inspo✨</summary> </details> <h1 id="first-an-apology" style="position:relative;">First an apology<a href="#first-an-apology" aria-label="first an apology permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>I first must share my sincere apologies. With all that was going on in the Iterative Community last month, I ran out of time to finish the June Heartbeat. With even more time passing there's lots to write about; let's do this!</p> <p><img src="https://media.giphy.com/media/CzbiCJTYOzHTW/giphy.gif" alt="Send Tom Hanks GIF"></p> <h2 id="mlem-release" style="position:relative;">MLEM Release<a href="#mlem-release" aria-label="mlem release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>On June 1st we released our latest open source tool in the Iterative ecosystem. MLEM is a model registry and deployment tool connected to your Git repo.<br> Together with <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> and <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> (Git Tag Ops), MLEM helps you maintain a model registry right in your git repository. Now we have one more step in the process of fully syncing together the software development and machine learning worlds. To learn more about MLEM, visit <a href="https://mlem.ai" target="_blank" rel="nofollow noopener noreferrer">the website</a>, <a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">⭐️ the repository</a>, <a href="https://dvc.org/blog/DVC-VS-Code-extension" target="_blank" rel="nofollow noopener noreferrer">read the blog post</a>, or <a href="https://youtu.be/a2Lc9kEgEM8" target="_blank" rel="nofollow noopener noreferrer">watch the video</a> of <a href="https://github.com/mike0sv" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Svehnikov's</strong></a> full presentation and demo on MLEM at our Release Party.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/a2Lc9kEgEM8?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>If pressed for time you can also catch a shorter version of the presentation with <a href="https://www.linkedin.com/in/agrigorev/" target="_blank" rel="nofollow noopener noreferrer">Alexey Grigorev</a> of <a href="https://datatalks.club/" target="_blank" rel="nofollow noopener noreferrer">Data Talks Club</a> <a href="https://www.youtube.com/watch?v=QQZUy0kSzOk" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h2 id="mlops-world-2022" style="position:relative;">MLOps World 2022<a href="#mlops-world-2022" aria-label="mlops world 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>I started writing this Heartbeat on the plane heading back from <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> in Toronto. This conference was a real treat! It was wonderful to meet so many Community members already using DVC and also to see conference talks advocating for our tools that we didn't even know were going to happen! Many thanks to <a href="https://www.interos.ai/" target="_blank" rel="nofollow noopener noreferrer">Interos'</a> <a href="https://www.linkedin.com/in/stephanrb3/" target="_blank" rel="nofollow noopener noreferrer"><strong>Stephen Brown</strong></a> and <a href="https://www.linkedin.com/in/amybachir/" target="_blank" rel="nofollow noopener noreferrer"><strong>Amy Bachir</strong></a> for sharing about DVC and CML in the talk, <em>A GitOps Approach to Machine Learning.</em></p> <p>Additionally, it was great to finally meet in person all the people from the greater MLOps Community that I'd previously only known virtually including <a href="https://www.linkedin.com/in/dpbrinkm/" target="_blank" rel="nofollow noopener noreferrer"><strong>Demetrios Brinkman</strong></a> of <a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer">MLOps Community Slack</a>, our friends from <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a>, and <a href="https://tryolabs.com/" target="_blank" rel="nofollow noopener noreferrer">Tryo-Labs</a>, and one of our Community Champions <a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sami Jawhar</strong></a> who presented at one of our most engaging meetups on record, asking the question <em>What IS an experiment?</em> You can find this great talk below.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DxZdWq3Weng?rel=0&%3B=&%3Bshowinfo=0%3B&start=1309" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>The conference talks were great. I was able to attend three:</p> <ul> <li><em>Top 5 Lessons Learned in Helping Organizations Adopt MLOps practices</em> from <a href="https://www.linkedin.com/in/shelbee-eigenbrode/" target="_blank" rel="nofollow noopener noreferrer"><strong>Shelbee Eigenbrode</strong></a>, Principal AI/ML Specialist</li> <li><em>Panel: What Every Product Manager Delivering AI Solutions Should Know</em>, moderated by <a href="https://www.linkedin.com/in/jessie-lamontagne-89b2a912b/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jessie Lamontagne</strong></a> (who was lucky enough to take home her very own DeeVee, see below), Data Science Manager at Kinaxis; with <a href="https://www.linkedin.com/in/nahlags/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nahla Salem</strong></a>, Senior Product Manager at <a href="https://www.yelp.com/" target="_blank" rel="nofollow noopener noreferrer">Yelp</a>; <a href="https://www.linkedin.com/in/anneya-golob/" target="_blank" rel="nofollow noopener noreferrer"><strong>Anneya Golob</strong></a>, Staff Data Scientist at <a href="https://www.shopify.com/" target="_blank" rel="nofollow noopener noreferrer">Shopify</a>, and <a href="https://www.linkedin.com/in/phillipgornicki/" target="_blank" rel="nofollow noopener noreferrer"><strong>Phillip Gorniki</strong></a>, St. Product Manager at <a href="https://www.kinaxis.com/en" target="_blank" rel="nofollow noopener noreferrer">Kinaxis</a>. A particular quote that was a stand out for me from this panel from Nahla, was, "If everything is a priority, nothing is a priority." That was a lesson I needed to take to heart, hence a bumped Heartbeat. 😢</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2fe15e9b439f17b5ad8ac2d6aa406c9/39600/jessie-lamontagne.png" alt="Jessie Lamontagne" title="Jessie Lamontagne" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Jessie Lamontagne of Kinaxis with DeeVee! (<a href="https://www.linkedin.com/in/jessie-lamontagne-89b2a912b/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>I heard great feedback from attendees on conference talks as well. In general, the atmosphere at the conference had a fantastic, positive vibe with great connections made through the event app, the conference itself, and parties and networking opportunities 🥳🍻 We also thoroughly enjoyed being Expo Booth neighbors with <a href="https://www.seldon.io/" target="_blank" rel="nofollow noopener noreferrer">Seldon</a> (model serving) and <a href="https://www.genesiscloud.com/" target="_blank" rel="nofollow noopener noreferrer">Genesis Cloud</a> (environmentally sustainable GPUs!) I must finally give hats off to the organizers <a href="https://www.linkedin.com/in/farazthambi/" target="_blank" rel="nofollow noopener noreferrer"><strong>Faraz Thambi</strong></a> and <a href="https://www.linkedin.com/in/tinaaprile/" target="_blank" rel="nofollow noopener noreferrer"><strong>Tina Aprile</strong></a>, who delivered an extremely well thought out and run, in-person Conference! If you didn't attend this year, you should definitely put it on your radar for next, or attend their <a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a> in November! Plus Toronto was fun! Check out our team dinner the last night from the CN Tower.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/89437ee946b86dc731eb573ef5ca24e3/03346/team-toronto.jpg" alt="Team Dinner at the CN Tower" title="Team Dinner at the CN Tower" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Team dinner at the CN Tower - Pictured L to R: Gabriella Caraballo, Stephanie Roy, Mike Moynihan, Jorge Orpinel Perez (forward), me, Mikhail Sveshnikov, Milecia McGregor (forward), Max Aginsky, Alex Kim (forward), and Dmitry Petrov)</em></p> <h2 id="dvc-extension-for-vs-code" style="position:relative;">DVC Extension for VS Code<a href="#dvc-extension-for-vs-code" aria-label="dvc extension for vs code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We just released our DVC extension for VS Code! It was so fun to let the cat out of the bag to conference goers and watch their eyes light up! 😃 This was a foreshadowing of events to come at the release! While it hadn't been completely a secret since <a href="https://twitter.com/DynamicWebPaige/status/1430920240251035649" target="_blank" rel="nofollow noopener noreferrer">Paige Bailey's tweet</a> about it a while ago and the fact that it's been on the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code Marketplace</a> for a couple of months so beta testers could try it out, we did finally, officially release the tool June 12th.</p> <p>And OH. MY. GOSH. The response has been amazing! Already over 3,400 people watched the video below on YouTube. And 1000 more new users downloaded the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Extension for VS Code</a> from the marketplace, just within the first couple of days!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/LHi3SWGD9nc?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>You will find in this extension:</p> <ul> <li>tons of experiment tracking and table functionality over your regular CLI</li> <li>live metrics tracking</li> <li>the ability to run and queue experiments directly from the experiment table or the command tree</li> <li>sorting, drag and drop column and group movement</li> <li>expanded plot viewing capabilities - zoom into plots and save them as PNGs or SVGs for your reporting needs</li> </ul> <p>If you are a DVC and VS Code user, you will be a happy camper! Please try it and as always reach out with feedback! We want to make these tools better for you!</p> <p>Since the release, <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> talked with <a href="https://twitter.com/ReynaldAdolphe" target="_blank" rel="nofollow noopener noreferrer"><strong>Reynold Adolphe</strong></a> on the VS Code Livestream and showed off the tool. You can check that out here! 👇🏽</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Eq3100S3aHw?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="content-from-the-community" style="position:relative;">Content from the Community<a href="#content-from-the-community" aria-label="content from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There's been lots of juicy content from the Community <a href="https://dvc.org/blog/may-22-heartbeat" target="_blank" rel="nofollow noopener noreferrer">since the last Heartbeat</a>. When I first started at Iterative over a year and a half ago, I would hope each month that there would be enough content from the Community to write about. This is no longer an issue; I sadly have to filter it now, so that these Heartbeats don't go on for days. If you've written something about our tools and it hasn't appeared in a Heartbeat, just know that we see it and we are grateful for all the Community's efforts to share about our tools! 🙏🏼</p> <h3 id="alex-strick-van-linschoten---more-data-more-problems-using-dvc-to-handle-data-versioning-for-a-computer-vision-problem" style="position:relative;"><strong>Alex Strick van Linschoten</strong> - More Data, More Problems: Using DVC to handle data versioning for a computer vision problem<a href="#alex-strick-van-linschoten---more-data-more-problems-using-dvc-to-handle-data-versioning-for-a-computer-vision-problem" aria-label="alex strick van linschoten more data more problems using dvc to handle data versioning for a computer vision problem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/strickvl/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Strick van Linschoten</strong></a> brings us <a href="https://mlops.systems/tools/redactionmodel/computervision/mlops/2022/05/24/data-versioning-dvc.html#-appendix-how-to-switch-from-git-lfs-to-dvc" target="_blank" rel="nofollow noopener noreferrer">this great overview of DVC's versioning capabilities</a> on his use of DVC in a redaction identifier project. He goes through the pluses of using DVC which he mentions as "be(ing) more or less unchallenged for what it does in the data versioning domain." He had previously used Git LFS and found it to be less robust so made the switch to DVC. In his post, he provides a <a href="https://mlops.systems/tools/redactionmodel/computervision/mlops/2022/05/24/data-versioning-dvc.html#-appendix-how-to-switch-from-git-lfs-to-dvc:~:text=I%E2%80%99m%20missing%20out%E2%80%A6-,%F0%9F%8F%83%20Appendix%3A%20How%20to%20switch%20from%20git%2Dlfs%20to%20DVC,-When%20I%20first" target="_blank" rel="nofollow noopener noreferrer">tutorial on making the switch from Git LFS to DVC</a>. We are so grateful to Alex for sharing this guide with the Community!</p> <p>Also super worthy of mention is Alex's shout-out about our welcoming Community. We are thankful for this praise and for his contributions to our Community. 🙏🏼</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c213851c35adb7755dbf690baca5f00d/39600/alex-strick-van-linshoten.png" alt="Iterative Community shout out from Alex Strick van Linshoten" title="Iterative Community shout out from Alex Strick van Linshoten" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Thanks for the shout-out Alex! (<a href="https://mlops.systems/tools/redactionmodel/computervision/mlops/2022/05/24/data-versioning-dvc.html#-appendix-how-to-switch-from-git-lfs-to-dvc" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="mymlops-stack" style="position:relative;">MyMLOps Stack<a href="#mymlops-stack" aria-label="mymlops stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://mymlops.com/" target="_blank" rel="nofollow noopener noreferrer">MyMLOps.com</a> provides a tool to help you build a cool diagram for your MLOps Stack. There's no about page there or indication of who made this for the greater MLOps Community, which is frankly a bit sus. Nevertheless, we were excited to see DVC included in the section of Experiment Tracking as it should! We know there are other great experiment tracking tools out there, and we are content to see that the larger Community is starting to recognize this capability with DVC! We like to think of it as taking a step beyond tracking to versioning. To learn more about experiment versioning, <a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">visit this blog piece</a> from Technical Product Manager - DVC, <a href="https://www.linkedin.com/in/david-berenbaum-20b6b424/" target="_blank" rel="nofollow noopener noreferrer">Dave Berenbaum</a>.</p> <p>Our team had an internal discussion about the absence of our tools from certain categories, DVC and CML for artifact tracking, CML for Pipeline Orchestration Runtime Engine, MLEM for Model Registry and Serving. But like everything in this space, things are changing constantly. Thanks to whoever you are out there that made this nifty tool!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/594a223e585365459e0234eed352f138/39600/mymlops.png" alt="MyMLOps.com" title="MyMLOps.com" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>MLOps tool stack diagram generator from MyMLOps.com (<a href="https://mymlops.com/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="samson-zhang-mlops-how-dvc-smartly-manages-your-data-sets-for-training-your-machine-learning-models-on-top-of-git" style="position:relative;"><strong>Samson Zhang</strong>: MLOps: How DVC smartly manages your data sets for training your machine learning models on top of Git<a href="#samson-zhang-mlops-how-dvc-smartly-manages-your-data-sets-for-training-your-machine-learning-models-on-top-of-git" aria-label="samson zhang mlops how dvc smartly manages your data sets for training your machine learning models on top of git permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/samson-zhang-887135115/" target="_blank" rel="nofollow noopener noreferrer"><strong>Samson Zhang</strong></a> of <a href="www.littlebigcode.fr">LittleBigCode</a> writes an in-depth article in <a href="https://medium.com" target="_blank" rel="nofollow noopener noreferrer">Medium</a> on how DVC aptly manages large datasets. He discusses why DVC is needed and how it is a better option compared to MLFlow because MLflow does not optimize storage for file duplication like DVC does, as well as Git-LFS for the same reasons mentioned by Alex Strick van Linschoten in the piece mentioned above. Samson goes through a very thorough overview of the tool, how it works and how to use it. He includes some best practices that he has figured out while using the tool and goes over how to set up a dataset registry which he finds particularly useful with DVC.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4f841c2fa3b8b4802ca2368d9697b1df/39600/samson-zhang.png" alt="Samson Zhang, DVC Workflow, Cache and Storage" title="Samson Zhang, DVC Workflow, Cache and Storage" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC workflow, cache, and storage (<a href="https://medium.com/hub-by-littlebigcode/mlops-how-dvc-smartly-manages-your-data-sets-for-training-your-machine-learning-models-on-top-of-b73857e54e52" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="dror-atariah-getting-to-know-mlem" style="position:relative;"><strong>Dror Atariah</strong>: Getting to Know MLEM<a href="#dror-atariah-getting-to-know-mlem" aria-label="dror atariah getting to know mlem permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 100px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/299a2342f2c5966eba1ce37590234270/39600/awesome.png" alt="Awesome MLEM" title="Getting to Know MLEM" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://www.linkedin.com/in/atariah/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dror Atariah</strong></a> is the first Community member to write about MLEM! 🎉 In <a href="http://drorata.github.io/posts/2022/Jun/17/getting-to-know-mlem/" target="_blank" rel="nofollow noopener noreferrer">his piece</a> he gives a review of the tool and starts with a general overview. Giving it a try with the iris dataset, he ultimately builds a Docker image with MLEM to get predictions from a trained model served by MLEM in an API. You can try out his project <a href="https://github.com/drorata/mlem-review" target="_blank" rel="nofollow noopener noreferrer">in this repo!</a></p> <h3 id="-new-docs" style="position:relative;">✍🏼 New Docs<a href="#-new-docs" aria-label=" new docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>As you can imagine, with new tools come new docs! The docs and product teams have been furiously busy making sure that you have the docs you need to try our new tools. Of note please find:</p> <ul> <li><a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">MLEM Docs</a></li> <li><a href="https://mlem.ai/doc/use-cases/model-registry" target="_blank" rel="nofollow noopener noreferrer">Machine Learning Model Registry</a> in <a href="https://dvc.org/doc/use-cases/model-registry" target="_blank" rel="nofollow noopener noreferrer">DVC.org docs</a> as well as in the <a href="https://mlem.ai/doc/use-cases/model-registry" target="_blank" rel="nofollow noopener noreferrer">MLEM docs</a></li> <li><a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">VS Code docs and walkthrough</a></li> </ul> <h2 id="-tons-of-new-content-on-the-blog" style="position:relative;">✍🏼 Tons of new content on the blog<a href="#-tons-of-new-content-on-the-blog" aria-label=" tons of new content on the blog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi-docker" target="_blank" rel="nofollow noopener noreferrer">Moving Local Experiments to the Cloud with Terraform Provider Iterative (TPI) and Docker</a></li> </ul> <p>Have you ever or are you struggling with syncing data with one of the cloud providers? We know that comes up a lot in the Discord server. So <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer">Milecia Mc Gregor</a> wrote three detailed pieces to help you out.</p> <ul> <li><a href="https://dvc.org/blog/aws-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Syncing Data to AWS S3</a></li> <li><a href="https://dvc.org/blog/using-gcp-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Syncing Data to GCP</a></li> <li><a href="https://dvc.org/blog/azure-remotes-in-dvc" target="_blank" rel="nofollow noopener noreferrer">Syncing Data to Azure Blob Storage</a><br> Whatever your flavor, she's got you covered. Look out for short videos covering the same topics this quarter.</li> </ul> <p>Find more of your Discord questions answered in the latest editions of Community Gems. 💎</p> <ul> <li><a href="https://dvc.org/blog/may-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">May Community Gems</a></li> <li><a href="https://dvc.org/blog/june-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">June Community Gems</a></li> </ul> <h2 id="-online-course-updates" style="position:relative;">🧑🏽‍💻 Online Course Updates<a href="#-online-course-updates" aria-label=" online course updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have surpassed 1300 students in our <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">Iterative Tools School!</a> 🎉 We now have in place:</p> <ul> <li>Closed captions</li> <li>Course guides for each lesson. For some of these, you will find the video embedded into the lesson itself, but for the lessons that include code snippets, the guides are in PDF form so that you can copy and paste them to your heart's content! 😉</li> </ul> <p>If you are in the course already or through social media you may have noticed <a href="https://twitter.com/SoyGema" target="_blank" rel="nofollow noopener noreferrer">Gema Perreño Piqueras'</a> amazing notes on the modules she has created (see below). 🚨Spoiler alert: Gema's joining the DevRel team next week! So look forward to more great content from her.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2f76b185a9df69f8115c4f9765d63833/03346/gema-course-notes.jpg" alt="Gema Perreño Piqueeras' Course Notes" title="Gema Perreño Piqueeras' Course Notes" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Gema Perreño Piqueras' Course Notes (<a href="https://twitter.com/SoyGema/status/1543210842749079554?s=20&t=DMCw3cN8rFbwlD1hD_rotA" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We'll be at <a href="https://ai4.io/" target="_blank" rel="nofollow noopener noreferrer">AI4</a> from August 16-18.<br> <a href="https://twitter.com/fullstackml?lang=en" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> will give a talk as well as participate in a panel discussion on MLOps. If you are attending, stop by the booth and say hi or check out one of the in-booth demos we will have on our tools throughout the day.</p> <p>Additional conferences we will be attending this year:</p> <ul> <li><a href="https://odsc.com/california/" target="_blank" rel="nofollow noopener noreferrer">ODSC West</a> in San Francisco</li> <li><a href="https://deeplearningworld.de/" target="_blank" rel="nofollow noopener noreferrer">Deep Learning World</a> - Berlin</li> <li><a href="https://www.re-work.co/events/mlops-summit-2022" target="_blank" rel="nofollow noopener noreferrer">MLOps Summit - Re-work</a> - London</li> <li><a href="https://www.torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Summit</a>- Toronto</li> </ul> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the open positions. This month we are especially seeking a fit for the Senior Software Engineer (Dataset Label Management, Python) role, so if that fits you or someone else you know, get applying! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a25646ea1418d7a63d5bc4e68079fba9/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Because I missed a month, there's just going to have to be two…</p> <p>We were excited to see this project come up from <a href="https://twitter.com/algo_diver" target="_blank" rel="nofollow noopener noreferrer">Chansung</a> using DVC, Iterative Studio, Huggingface and Jarvis Labs AI.<br> Looking forward to seeing how it develops! 🍿</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Redrew for easier understanding of <a href="https://twitter.com/hashtag/git_mlops?src=hash&ref_src=twsrc%5Etfw">#git_mlops</a> projects with <a href="https://twitter.com/DVCorg">@DVCorg</a> and <a href="https://twitter.com/jarvislabsai">@jarvislabsai</a>. The code needs to be cleaned, but it now deploys any model from any branches to <a href="https://twitter.com/huggingface">@huggingface</a> model and space repository.<br><br>Basically <a href="https://twitter.com/DVCorg">@DVCorg</a> is heavily used, but I just put the one logo in it. <a href="https://t.co/Cj7Z7KPaOy">pic.twitter.com/Cj7Z7KPaOy</a></p>— chansung (@algo_diver) <a href="https://twitter.com/algo_diver/status/1530455733837647873">May 28, 2022</a></blockquote> <p>And we have this great Tweet thread from <a href="https://twitter.com/LeonMenkreo" target="_blank" rel="nofollow noopener noreferrer">Leon Menkreo</a> about how he's taken back control of his data, models, and predictions with DVC!</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I took back control of my data, models, and predictions with<br><br>Data Versioning 🔀<br><br>Everything you need to get started with DVC by <a href="https://twitter.com/DVCorg">@DVCorg</a> in one mega 🧵:<br> <br>⁉️ What is DVC?<br>🔀 DVC & Model Versioning<br>🐍 DVC in python<br>📚 Resources</p>— Leon Menkreo (@LeonMenkreo) <a href="https://twitter.com/LeonMenkreo/status/1545410381677531136">July 8, 2022</a></blockquote> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/using-gcp-remotes-in-dvchttps://dvc.org/blog/using-gcp-remotes-in-dvcWed, 06 Jul 2022 00:00:00 GMT<p>When you’re working on a data science project that has huge datasets, it’s common to store them in cloud storage. You’ll also be working with different versions of the same datasets to train a model, so it’s crucial to have a tool that enables you to do this quickly and easily. That’s why we’re going to do a quick walkthrough of how to set up a remote in a GCP storage bucket and handle data versioning with <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>.</p> <p>We’ll start by creating a new storage bucket in our GCP account, then we’ll show how you can add DVC to your project, and finally, we’ll make updates to the dataset with DVC commands. We’ll be working with <a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo</a> if you want an example to play with. By the time you finish, you should be able to create this setup for any machine learning project using a GCP remote.</p> <h2 id="set-up-a-gcp-storage-bucket" style="position:relative;">Set up a GCP storage bucket<a href="#set-up-a-gcp-storage-bucket" aria-label="set up a gcp storage bucket permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Make sure that you already have a <a href="https://console.cloud.google.com" target="_blank" rel="nofollow noopener noreferrer">GCP account</a>. You’ll need a valid credit card to create a new account. Once you’re logged into your account, you should see a screen like this with some of the services GCP offers.</p> <p><em>Note:</em> Remember, GCP does have a <a href="https://cloud.google.com/free/docs/gcp-free-tier" target="_blank" rel="nofollow noopener noreferrer">free tier</a> if you just want to try it out.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f9291575b358c6d1e93d3902bd8f8df6/39600/gcp_initial_page.png" alt="GCP initial page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>From here, you'll need to create a new project. Search for "create a project" and click the "IAM & Admin" option. You'll enter the name of the project, which is <code>Bicycle Project</code>, and choose the organization and location and click the <code>Create</code> button. This will take you to your project dashboard and show you all of the stats and settings you have available.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/31c5ad53787a867d969e87e822a0832d/39600/gcp_new_project.png" alt="create a new GCP project" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Then you need to go to <code>Cloud Storage</code> in the left sidebar to create a bucket to store the data. When you get to the Cloud Storage page, you should see something similar to this and you’ll click the <code>Create Bucket</code> button.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/990bd68c194aa8588a25cb16e7cee4ac/39600/create_gcp_bucket.png" alt="create_gcp_bucket.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>The Bucket page will have a lot of configurations you can set, but you can leave the settings in the default state if there’s nothing you need to customize. We have named this example bucket <code>updatedbikedata</code> as you can see below.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/95a0fcf3369dc54bc0ae23f1d28eb937/39600/gcp_bucket_options.png" alt="gcp_bucket_options.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Now you can save your changes and you’ll be redirected to the <code>Bucket Details</code> page and you’ll see the bucket you just created.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/594ddcf13327666ccddcca1d3f0ec75a/39600/created_gcp_bucket.png" alt="created_gcp_bucket.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="get-your-credentials" style="position:relative;">Get your credentials<a href="#get-your-credentials" aria-label="get your credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Since you have the bucket created, we need to get the credentials to connect the GCP remote to the project. Go to the <code>IAM & Admin</code> service and go to <code>Service Accounts</code> in the left sidebar.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/696a2d32a8294f63e608c3a2823ef65d/39600/gcp_empty_service_account.png" alt="no service accounts" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Click the <code>Create Service Account</code> button to create a new service account that you'll use to connect to the DVC project in a bit. Now you can add the name and ID for this service account and keep all the default settings. We've chosen <code>bicycle-service-account</code> for the name and <code>bicycle-account</code> for the ID. Click <code>Create and Continue</code> and it will show the permissions settings. Select <code>Owner</code> in the dropdown and click <code>Continue</code>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ffa1f3370523aa3068032c6db3e7782f/39600/gcp_service_account_permissions.png" alt="service account permissions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Then add your user to have access to the service account and click <code>Done</code>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/867a743f3bf910458d9c473f62a9906d/39600/gcp_service_account_user_access.png" alt="service account user access" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Finally, you'll be redirected to the <code>Service accounts</code> page.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/acd81a329b5b3c20d943c378603d9ea4/39600/gcp_create_service_account.png" alt="service account with name and ID" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>You’ll see your service account and you’ll be able to click on <code>Actions</code> and go to where you <code>Manage keys</code> for this service account.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/989cff34794f4244822a454c66b7dbd0/39600/gcp_service_account.png" alt="manage keys on service account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Once you’ve been redirected, click the <code>Add Key</code> button and this will bring up the credentials you need to authenticate your GCP account with your project. Go ahead and download the credentials file and store it somewhere safe.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9625bb956a5e8bd2c69f71d87df4a863/39600/gcp_key.png" alt="gcp_key.png" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>That’s it for setting up your storage bucket and getting the credentials you need! Now let’s add DVC to our demo repo and set up the remote.</p> <h2 id="set-up-a-dvc-project" style="position:relative;">Set up a DVC project<a href="#set-up-a-dvc-project" aria-label="set up a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>First, add DVC as a requirement to your project with the following installation command:</p> <p><code>$ pip install 'dvc[gs]'</code></p> <p>Then you can initialize DVC in your own project with the following command:</p> <p><a href="https://dvc.org/doc/command-reference/init"><code>$ dvc init</code></a></p> <p>This will add all of the DVC internals needed to start versioning your data and tracking experiments. Now we need to set up the remote to connect our project data stored in GCP to the DVC repo.</p> <h3 id="create-a-default-remote" style="position:relative;">Create a default remote<a href="#create-a-default-remote" aria-label="create a default remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now we can make the GCP storage the default remote for the project with the following command:</p> <p><a href="https://dvc.org/doc/command-reference/remote/add#-d"><code>$ dvc remote add -d bikes gs://updatedbikedata</code></a></p> <p>This creates a default remote called <code>bikes</code> that connects to the <code>updatedbikedata</code> bucket we made earlier which is where the any data for the model will be stored.</p> <h3 id="add-gcp-credentials" style="position:relative;">Add GCP credentials<a href="#add-gcp-credentials" aria-label="add gcp credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In order for DVC to be able to push and pull data from the remote, you need to have valid GCP credentials.</p> <p>If you are using the <a href="https://cloud.google.com/sdk/docs/install-sdk" target="_blank" rel="nofollow noopener noreferrer">GCP CLI (google-cloud-sdk)</a> already, you should be able to run <code>gcloud auth application-default login</code>. This method doesn't require a service account.</p> <p>You can also authenticate with the service account we created earlier in a couple of ways with that credentials file we downloaded.</p> <p>You can run the following command with your service account email.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">gcloud</span> auth activate-service-account bicycle-service-account@tonal-history-154018.iam.gserviceaccount.com <span class="token parameter variable">--key-file</span><span class="token operator">=</span><span class="token punctuation">..</span>/tonal-history-154018-e62a79baf90f.json</span></code></pre></div> <p>If you don't have the GCP CLI installed and you want to use the service account, you can set the <code>GOOGLE_APPLICATION_CREDENTIALS</code> environment variable to point to the credentials file, like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">GOOGLE_APPLICATION_CREDENTIALS</span><span class="token operator">=</span><span class="token string">'../tonal-history-154018-e62a79baf90f.json'</span></span></code></pre></div> <p>Or you can add the credentials file location with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--local</span> bikes credentialpath <span class="token string">'../tonal-history-154018-e62a79baf90f.json'</span></span></code></pre></div> <p>You can check out more about authentication <a href="https://cloud.google.com/sdk/docs/authorizing" target="_blank" rel="nofollow noopener noreferrer">here in the GCP docs</a>.</p> <h3 id="push-and-pull-data-with-dvc" style="position:relative;">Push and pull data with DVC<a href="#push-and-pull-data-with-dvc" aria-label="push and pull data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now you can push data from your local machine to the GCP remote! First, add the data you want DVC to track with the following command:</p> <p><a href="https://dvc.org/doc/command-reference/add"><code>$ dvc add data</code></a></p> <p>This will allow DVC to track the entire <code>data</code> directory so it will note when any changes are made. Then you can push that data to your GCP remote with this command:</p> <p><a href="https://dvc.org/doc/command-reference/push"><code>$ dvc push</code></a></p> <p>Here's what that data will look like when it has been successfully uploaded to GCP.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/07a71607f24160ce95b254e6a00fc2cc/39600/data_in_gcp.png" alt="data in GCP" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Then if you move to a different machine or someone else needs to use that data, it can be accessed by cloning or forking the project repo, setting up the remote and running:</p> <p><a href="https://dvc.org/doc/command-reference/pull"><code>$ dvc pull</code></a></p> <p><em>Note:</em> Depending on the authentication method being used, there might be some required extra steps, such as making sure users actually have the permissions to read/write to the bucket.</p> <p>That’s it! Now you can connect any DVC project to a GCP storage bucket. If you run into any issues, make sure to check that your credentials are valid, check if your user has MFA enabled, and check that the user has the right level of permissions.</p>https://dvc.org/blog/june-22-community-gemshttps://dvc.org/blog/june-22-community-gemsWed, 29 Jun 2022 00:00:00 GMT<h2 id="is-there-a-shorthand-command-to-commit-changes-to-all-modified-files-in-dvc-without-manually-adding-them-all-individually" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/981498675689828362" target="_blank" rel="nofollow noopener noreferrer">Is there a shorthand command to commit changes to all modified files in DVC without manually adding them all individually?</a><a href="#is-there-a-shorthand-command-to-commit-changes-to-all-modified-files-in-dvc-without-manually-adding-them-all-individually" aria-label="is there a shorthand command to commit changes to all modified files in dvc without manually adding them all individually permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thanks for the question @Ramnath T!</p> <p>If you already have data tracked by DVC, the <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> command adds all the changes to those files or directories without having to name each target. You'll still need to remember to commit any other changes you've made to Git as well.</p> <p>If you don't have data tracked by DVC, run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add <file name or folder name></code></a> and the data will be added to your local cache and no commit is needed. This is how we make DVC aware of any new data we want versioned.</p> <p>When you run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, a file hash will be calculated, the file content will be moved to the cache, and a <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file will be created to start tracking the added data. If you're working with remotes using the <code>--to-remote</code> option, you can skip the local cache entirely and move the file contents directly to your remote storage.</p> <h2 id="how-can-i-connect-iterative-studio-to-a-remote-repo-on-a-private-network-like-a-gitlab-server" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/981543978644172830" target="_blank" rel="nofollow noopener noreferrer">How can I connect Iterative Studio to a remote repo on a private network, like a GitLab server?</a><a href="#how-can-i-connect-iterative-studio-to-a-remote-repo-on-a-private-network-like-a-gitlab-server" aria-label="how can i connect iterative studio to a remote repo on a private network like a gitlab server permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Good question about <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> from @LilDataScientist!</p> <p>This is something that our users asked quite a bit, so we wrote up a whole guide about <a href="https://dvc.org/doc/studio/user-guide/connect-custom-gitlab-server" target="_blank" rel="nofollow noopener noreferrer">custom GitLab server connections</a>. It's a quick walkthrough of how to set up the permissions you'll need and connecting your team to Studio.</p> <p>You can find lots of great guides and explanations about everything Studio in the <a href="https://dvc.org/doc/studio/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a> section of the docs!</p> <h2 id="how-does-dvc-get-url-interact-with-the-cache-compared-to-dvc-import-url" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/981862313076346920" target="_blank" rel="nofollow noopener noreferrer">How does <code>dvc get-url</code> interact with the cache compared to <code>dvc import-url</code>?</a><a href="#how-does-dvc-get-url-interact-with-the-cache-compared-to-dvc-import-url" aria-label="how does dvc get url interact with the cache compared to dvc import url permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This is an awesome question from @Gema Parreno!</p> <p>When you run <a href="https://dvc.org/doc/command-reference/get-url"><code>dvc get-url</code></a>, it downloads the file/directory to your local file system. It's <em>not</em> tracking the downloaded data with a <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file. It's just pulling that data from some source to your file system. If you want to download a file or directory without needing a DVC project, you can use the <a href="https://dvc.org/doc/command-reference/get-url"><code>dvc get-url</code></a> command.</p> <p>On the other hand, when you run <a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a>, the local <code>cache</code> folder inside of <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> will be updated. This is similar to running <a href="https://dvc.org/doc/command-reference/get-url"><code>dvc get-url</code></a> and <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> together except that <a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a> also saves a link to the original file/directory location so that if it changes, you can download the updated data.</p> <p>There is one more option to bypass the local cache and transfer data directly to your remote storage using <a href="https://dvc.org/doc/command-reference/import-url#--to-remote"><code>dvc import-url <url> --to-remote</code></a>. This doesn't download anything to your local cache so it's another way to transfer data between remotes.</p> <h2 id="if-an-image-is-present-in-different-directories-in-different-projects-will-the-shared-cache-store-them-both-as-one-hash-or-will-their-different-paths-mean-the-same-image-appears-twice-in-the-cache" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/984408209387298837" target="_blank" rel="nofollow noopener noreferrer">If an image is present in different directories in different projects, will the shared cache store them both as one hash or will their different paths mean the same image appears twice in the cache?</a><a href="#if-an-image-is-present-in-different-directories-in-different-projects-will-the-shared-cache-store-them-both-as-one-hash-or-will-their-different-paths-mean-the-same-image-appears-twice-in-the-cache" aria-label="if an image is present in different directories in different projects will the shared cache store them both as one hash or will their different paths mean the same image appears twice in the cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Great question about the cache from @paulwrightkcl!</p> <p>DVC will index the whole directory, but there will only be one hash per file. So the same image will only appear once in the cache. What <em>will</em> be duplicated in the cache is the <code>.dir</code> hash that DVC uses internally as the directory tree representation.</p> <p>In summary, the image file is only stored in the shared cache once unless it's modified in one of the directories.</p> <h2 id="is-it-possible-to-limit-which-columns-show-for-experiments-in-the-metrics-table" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/985448515402616842" target="_blank" rel="nofollow noopener noreferrer">Is it possible to limit which columns show for experiments in the metrics table?</a><a href="#is-it-possible-to-limit-which-columns-show-for-experiments-in-the-metrics-table" aria-label="is it possible to limit which columns show for experiments in the metrics table permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Nice question from @DylanTF!</p> <p>You can use <a href="https://dvc.org/doc/command-reference/exp/show#--drop"><code>dvc exp show --drop</code></a> (or <code>--keep</code>) to decide what to hide (or show). For example, if you have a table like this:</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./clf<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./data<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./data/train.pkl<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>./src/train.py<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>src/evaluate.py<span class="token hide">**</span></span></span> </span> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace - - - 20210428 300 75 - a9bb63e aded63c bdc3fe9 b0ef2a1 mlem-serve Jun 16, 2022 0.76681 0.38867 20210428 300 75 - a9bb63e aded63c bdc3fe9 b0ef2a1 </span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>You could clean it up with a command like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--drop</span> <span class="token string">'Created|train.seed|./clf|./data/*|./src/train.py|src/evaluate.py'</span></span></code></pre></div> <p>Then get a table like this:</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> </span> ───────────────────────────────────────────────────────────────── <span class="token rows"> workspace - - 300 75 mlem-serve 0.76681 0.38867 300 75 </span> ─────────────────────────────────────────────────────────────────</code></pre></div> <p>Alternatively, you can run the following command to only show the columns that have changed in the experiment run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--only-changed</span></span></code></pre></div> <p>This will produce a table similar to this one:</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> <span class="token bg-violet"><span class="token hide">dep:</span><span class="token bold"><span class="token hide">**</span>src/train.py<span class="token hide">**</span></span></span> </span> ───────────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace - - - 325 94279e0 mlem-serve Jun 16, 2022 0.76681 0.38867 300 bdc3fe9 </span> ─────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>You can also look at/edit these tables with the <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC VS Code extension</a>! If you're interested in more advanced visualizations, you should try out <a href="https://studio.datachain.ai/#features" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a>.</p> <h2 id="is-it-possible-to-create-commit-and-push-updates-to-datasets-using-dvc-with-python-instead-of-the-command-line" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/988895726257991740" target="_blank" rel="nofollow noopener noreferrer">Is it possible to create, commit, and push updates to datasets using DVC with Python instead of the command line?</a><a href="#is-it-possible-to-create-commit-and-push-updates-to-datasets-using-dvc-with-python-instead-of-the-command-line" aria-label="is it possible to create commit and push updates to datasets using dvc with python instead of the command line permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Fantastic question from @wlu07!</p> <p>Yes, we do have an internal <code>Repo</code> class to do DVC operations using Python. You can refer to the <a href="https://github.com/iterative/dvc/tree/main/dvc/commands" target="_blank" rel="nofollow noopener noreferrer">GitHub repo for the DVC CLI commands</a> to see how the CLI arguments are translated into the <code>Repo</code> function arguments and you can see how to use some of the <a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer"><code>Repo</code> methods in our docs</a>.</p> <p>Here's an example of how you might run DVC commands using Python:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo repo <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token string">"."</span><span class="token punctuation">)</span> repo<span class="token punctuation">.</span>add<span class="token punctuation">(</span><span class="token string">"test_dataset.csv"</span><span class="token punctuation">)</span> repo<span class="token punctuation">.</span>push<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> <p>Keep in mind that <code>dvc.repo.Repo</code> is not an official public API, so there is no guarantee it will always be in stable state.</p> <h2 id="how-can-i-write-generated-artifacts-back-to-a-github-repo-after-a-github-workflow-is-finished" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/983379949023006750" target="_blank" rel="nofollow noopener noreferrer">How can I write generated artifacts back to a GitHub repo after a GitHub workflow is finished?</a><a href="#how-can-i-write-generated-artifacts-back-to-a-github-repo-after-a-github-workflow-is-finished" aria-label="how can i write generated artifacts back to a github repo after a github workflow is finished permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Wonderful CML question from @Fourtin!</p> <p>If you want to add the artifact to your repo just like you would a file, then you should check out the <a href="https://cml.dev/doc/ref/pr" target="_blank" rel="nofollow noopener noreferrer"><code>cml pr <file></code> command</a>. You can use this to merge pull requests to the same branch the workflow was triggered from.</p> <p>For example, if you run a command like:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token cml">cml pr</span> <span class="token parameter variable">--squash</span> train.py</span></code></pre></div> <p>It will run <code>git add train.py</code>, commit the change, create a new branch, open a pull request, and squash and merge it.</p> <h2 id="is-there-a-way-to-programmatically-update-the-content-of-paramspy" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/987004036995764304" target="_blank" rel="nofollow noopener noreferrer">Is there a way to programmatically update the content of <code>params.py</code>?</a><a href="#is-there-a-way-to-programmatically-update-the-content-of-paramspy" aria-label="is there a way to programmatically update the content of paramspy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thanks for asking this @petek!</p> <p>If you have a <code>params.py</code> file like this:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">class</span> <span class="token class-name">TrainTestSplit</span><span class="token punctuation">:</span> FOLDER <span class="token operator">=</span> <span class="token string">"data/train_test_split"</span> SPLIT_METHOD <span class="token operator">=</span> <span class="token string">"proportional"</span></code></pre></div> <p>In DVC, you can update the params and run <a href="https://dvc.org/doc/command-reference/exp/run#--set-param"><code>dvc exp run --set-param <param></code></a>. Here's an example of what that might look like:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> params.py:TrainTestSplit.SPLIT_METHOD<span class="token operator">=</span><span class="token string">"proportional"</span></span></code></pre></div> <p><em>Note:</em> <a href="https://dvc.org/doc/command-reference/params#examples-python-parameters-file" target="_blank" rel="nofollow noopener noreferrer">It may not be able to update Python parameters correctly</a>. Because of this, we recommend you use <code>params.yaml</code> files.</p> <p>If you need a pure Python solution, you could try something like this:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>utils<span class="token punctuation">.</span>serialize <span class="token keyword">import</span> modify_py <span class="token keyword">with</span> modify_py<span class="token punctuation">(</span><span class="token string">"params.py"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> d<span class="token punctuation">:</span> d<span class="token punctuation">[</span><span class="token string">"key"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"value"</span></code></pre></div> <hr> <p><img src="https://media.giphy.com/media/pdSncNyYgaH0wqaCqp/giphy.gif" alt="Duck Dynasty GIF by DefyTV"></p> <p>Keep an eye out for our next Office Hours Meetup! Make sure you stay up to date with us to find out what it is! <a href="https://www.meetup.com/machine-learning-engineer-community-virtual-meetups/" target="_blank" rel="nofollow noopener noreferrer">Join our group</a> to stay up to date with specifics as we get closer to the event!</p> <p>Check out <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">our docs</a> to get all your DVC and CML questions answered!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to chat with the community!</p>https://dvc.org/blog/DVC-VS-Code-extensionhttps://dvc.org/blog/DVC-VS-Code-extensionTue, 14 Jun 2022 00:00:00 GMT<p>Since its beta release in 2017, DVC has become an essential tool for many data science teams. Its data versioning capabilities, reproducible pipelines, and experiment tracking features are at the core of our ecosystem of open MLOps tools.</p> <p>Today we are proud to launch a new product that extends how machine learning teams can use DVC: <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">our extension for Visual Studio Code</a>.</p> <p>With this extension, you get a full VS Code-native experimentation platform for your machine learning projects. Control your datasets and models, run experiments, view metrics, create plots, and much more. You can do this all from the comfort of your IDE, without the need for external services or logins. The only thing you need is a <a href="https://dvc.org/doc/start/data-management/data-pipelines#get-started-data-pipelines" target="_blank" rel="nofollow noopener noreferrer">DVC pipeline</a>.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/overview-9b53e8f5328a63e7590c574ffcd46f12.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <p> </p><section class="elp-content-holder"> <a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Download the DVC extension</h4> <div class="elp-description">Install the DVC extension from the VS Code marketplace to get started. Manage your data, run experiments, compare metrics, and visualize plots, all from the comfort of your IDE.</div> <div class="elp-link">marketplace.visualstudio.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-06-14/vscode-logo-9aa0983c47274b4c145190fb005e1bdd.png" alt="Download the DVC extension"> </div> </a> </section> <p></p> <h1 id="why-a-vs-code-extension" style="position:relative;">Why a VS Code extension?<a href="#why-a-vs-code-extension" aria-label="why a vs code extension permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We built DVC to expand upon the Git workflow to <a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">make it well-suited for ML experimentation</a>. This approach brought us independence from the infrastructure and provided a natural connection to best practices from software engineering. However, a pure CLI tool can only take things so far when it comes to visualizing experiments or displaying large tables.</p> <p><a href="https://insights.stackoverflow.com/survey/2021#section-most-popular-technologies-integrated-development-environment" target="_blank" rel="nofollow noopener noreferrer">VS Code is the IDE of choice for many</a> and was a natural choice for a platform to add a graphical interface to DVC.</p> <p>With this extension, we want to:</p> <ul> <li>Move the ML experimentation workflow into your IDE</li> <li>Provide interactive plots and tables for analyzing ML experiments</li> <li>Make DVC more accessible by providing an alternative to the complexity of the CLI</li> </ul> <p>As data scientists, DVC is our toolbox. This extension turns VS Code into our workshop.</p> <h1 id="features" style="position:relative;">Features<a href="#features" aria-label="features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Our extension introduces the DVC view, your one-stop-shop for everything related to your ML experiments. You can run new experiments from here, manage parameters, and compare metrics and plots for different models.</p> <p>The extension also adds panes to the <a href="https://code.visualstudio.com/docs/getstarted/userinterface#_explorer" target="_blank" rel="nofollow noopener noreferrer"><em>Explorer</em></a> and <a href="https://code.visualstudio.com/Docs/editor/versioncontrol" target="_blank" rel="nofollow noopener noreferrer"><em>Source Control</em></a> views for managing all datasets and models in your DVC repository.</p> <p><a href="https://youtu.be/LHi3SWGD9nc" target="_blank" rel="nofollow noopener noreferrer"><em>Check out the feature video on Youtube!</em></a></p> <h2 id="experiment-bookkeeping" style="position:relative;">Experiment bookkeeping<a href="#experiment-bookkeeping" aria-label="experiment bookkeeping permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Quickly run new experiments and compare their resulting metrics in the experiments table. Use the command palette or buttons to reproduce old experiments, run new ones, or add them to the queue for later.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/experiment-bookkeeping-b616e98446515f0510b5fb97df5cd613.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <h2 id="interactive-plots" style="position:relative;">Interactive plots<a href="#interactive-plots" aria-label="interactive plots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Select experiments to compare and visualize their performance in interactive plots. You can export these plots to PNG or SVG for use elsewhere.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/compare-experiments-a5306a13eff94b8e30d9d58e12b2a443.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <h2 id="live-tracking" style="position:relative;">Live tracking<a href="#live-tracking" aria-label="live tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Get insight into the training process of your models with live tracking of metrics. As soon as your metrics change, your plots will be updated automatically.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/live-metrics-d12f70f91085124fd74a4af4ea8f1f16.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <h2 id="reproducibility" style="position:relative;">Reproducibility<a href="#reproducibility" aria-label="reproducibility permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Click <em>Apply to workspace</em> to reproduce any past experiment. DVC will restore all artifacts for that experiment, and you can rerun it or use it as a base for a new experiment.</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/apply-to-workspace-923ba22dd0a7a6ef62cc0145ee2fc831.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <h2 id="data-management" style="position:relative;">Data management<a href="#data-management" aria-label="data management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Use the DVC tracked panel in the <a href="https://code.visualstudio.com/docs/getstarted/userinterface#_explorer" target="_blank" rel="nofollow noopener noreferrer"><em>Explorer</em></a> view to quickly navigate the files in the DVC project(s) in your workspace.</p> <p>The <a href="https://code.visualstudio.com/Docs/editor/versioncontrol" target="_blank" rel="nofollow noopener noreferrer"><em>Source Control</em></a> view now lets you manage datasets and models tracked by DVC without using the terminal. The DVC panel shows you the state of the workspace. From here, you can track artifacts and synchronize versions with your remote repository. Just like you use Git to track changes to your code!</p> <p><video controlslist="nodownload" preload="metadata" autoplay muted loop style="width:100%;"><source src="/2022-06-14/data-management-d768171bc3ae20848014004d6bee36e0.mp4" type="video/mp4"> Your browser does not support the video tag. </video></p> <hr> <h1 id="whats-next" style="position:relative;">What's next?<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>From here on out, we plan on making the extension even better with new features such as pipeline (DAG) support, <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">TPI</a> integration for remote execution, autocomplete for <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, and parallel coordinate plots.</p> <p>Of course, we would love to hear what you look forward to most. Please give us feedback on what you would like to see next!</p> <p><img src="https://media.giphy.com/media/cEYFeE4wJ6jdDVBiiIM/giphy.gif" alt="Space Cowboy GIF"></p> <h1 id="thank-you-️" style="position:relative;">Thank you! ❤️<a href="#thank-you-%EF%B8%8F" aria-label="thank you ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We would sincerely like to thank everyone who has helped make this project possible:</p> <ul> <li><a href="https://github.com/hediet" target="_blank" rel="nofollow noopener noreferrer">Henning Dieterichs</a>, for helping us get started</li> <li><a href="https://twitter.com/DynamicWebPaige" target="_blank" rel="nofollow noopener noreferrer">Paige Bailey</a>, for her support and warm tweets</li> <li><a href="https://www.linkedin.com/in/siddhanthunnithan/" target="_blank" rel="nofollow noopener noreferrer">Sid Unnithan</a>, for his review and help in getting the word out there</li> <li><a href="https://vscode-dev-community.slack.com/join/shared_invite/zt-zq9w7ddw-VD1NVQ4p2XLT7vh_kO7bJA#/shared-invite/email" target="_blank" rel="nofollow noopener noreferrer">The VS Code developer community</a></li> <li>Everyone who has beta-tested the extension and provided their feedback!</li> </ul> <h1 id="resources" style="position:relative;">Resources<a href="#resources" aria-label="resources permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Want to read more about DVC or the extension? Check out the following pages:</p> <ul> <li><a href="https://marketplace.visualstudio.com/items?itemName=Iterative.dvc" target="_blank" rel="nofollow noopener noreferrer">DVC extension on the VS Code marketplace</a></li> <li><a href="https://github.com/iterative/vscode-dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub repository</a></li> <li><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC docs</a></li> <li><a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">Dave Berenbaum's post on DVC's experiment versioning</a></li> <li><a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines" target="_blank" rel="nofollow noopener noreferrer">Alex Kim's guide on setting up an ML pipeline</a></li> <li><a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Iterative community on Discord</a></li> </ul> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/LHi3SWGD9nc?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div>https://dvc.org/blog/azure-remotes-in-dvchttps://dvc.org/blog/azure-remotes-in-dvcMon, 13 Jun 2022 00:00:00 GMT<p>When you’re working on a data science project that has huge datasets, it’s common to store them in cloud storage. You’ll also be working with different versions of the same datasets to train a model, so it’s crucial to have a tool that enables you to switch between datasets quickly and easily. That’s why we’re going to do a quick walkthrough of how to set up a remote with Azure Blob Storage and handle data versioning with <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>.</p> <p>We’ll start by creating a new blob storage container in our Azure account, then we’ll show how you can add DVC to your project. We’ll be working with <a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo</a> if you want an example to play with.</p> <admon type="info"> <p>By the time you finish, you should be able to create this setup for any machine learning project using an Azure remote.</p> </admon> <h2 id="set-up-an-azure-blob-storage-container" style="position:relative;">Set up an Azure blob storage container<a href="#set-up-an-azure-blob-storage-container" aria-label="set up an azure blob storage container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Make sure that you already have a <a href="https://azure.microsoft.com/en-us/features/azure-portal/" target="_blank" rel="nofollow noopener noreferrer">Microsoft Azure account</a>. When you log in, you should see a page like this.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/64c512d8a94c04c1857ad929b0111f22/39600/initial_azure.png" alt="initial Azure page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Search for <code>storage accounts</code> in the search bar and click <code>Storage accounts</code> under <code>Services</code>. Make sure you don't click the "classic" option.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d36abba60a0f4782c01e831f9d414c16/39600/storage_account_search.png" alt="search for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>This will bring you to the <code>Storage accounts</code> page where you'll need click the <code>Create storage account</code> button.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1ded7676bdc2404e371fc87c428ef229/39600/storage_account_page.png" alt="storage accounts page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Now you need to enter a <code>Resource group</code> and name for the account. You can create a new resource group right here, like we do, and call it <code>BicycleProject</code>. We'll name this storage account <code>bicycleproject</code>. Then you can leave all the default settings in place and click <code>Review + create</code>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/681ef7efd9fb3b770ca862910073965d/39600/storage_account_details.png" alt="storage account details" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Azure will run validation on the account and then you'll be able to click <code>Create</code> and it will generate the storage account.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/56870dafd3e809dfdbdc5da58b98b4a6/39600/created_storage_account.png" alt="created storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>You'll get redirected to a new page and you should click the <code>Go to resource</code> button. Now you should see all of the details for your storage account. In the left sidebar, got to on <code>Data storage</code> > <code>Containers</code>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/336f02f3dc5208eb4fd6aa657bb83dba/39600/bicycle_project_account.png" alt="bicycle project account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Then click the <code>+ Container</code> button at the top of the new page and you'll see a right sidebar open. In the name field, type <code>bikedata</code> and then click <code>Create</code>. Now we have everything set up for the blob storage to work.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/469f2358acd6ad6849cef0114f3a80a2/39600/bikedata_container.png" alt="new container for bike data" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="set-the-right-roles-for-your-azure-account" style="position:relative;">Set the right roles for your Azure account<a href="#set-the-right-roles-for-your-azure-account" aria-label="set the right roles for your azure account permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You'll need the right roles on your storage account and your container in order to connect this remote storage to your machine learning project.</p> <p>On the page for your <code>bicycleproject</code> storage account, go to the <code>Access Control (IAM)</code> in the left sidebar.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7b079b25579752299c2692d3559a49bb/39600/storage_account_iam.png" alt="update roles for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>On this page, you'll click <code>Add role assignment</code> and get directed to the page with all of the roles.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f251e28b6fd0fd87465cc467e7febacd/39600/storage_account_role.png" alt="update roles for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Select the <code>Storage Blob Data Contributor</code> role and click <code>Next</code></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a6c1bb759192b9b66b5461dffba75ad1/39600/storage_account_member.png" alt="update roles for storage account" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Then you can click <code>+ Select members</code> to add this role to your user.</p> <p>You'll also need to go through this exact flow for your <code>bikedata</code> container, so make sure you do this immediately after your storage account is updated.</p> <p>Since our Azure storage account and container have the correct roles now, let's set up the project!</p> <h2 id="set-up-a-dvc-project" style="position:relative;">Set up a DVC project<a href="#set-up-a-dvc-project" aria-label="set up a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>First, add DVC as a requirement to your project with the following installation command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token string">'dvc[azure]'</span></span></code></pre></div> <p>Then you can initialize DVC in your own project with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span></span></code></pre></div> <p>This will add all of the DVC internals needed to start versioning your data and tracking experiments. Now we need to set up the remote to connect our project data stored in Azure to the DVC repo.</p> <h3 id="create-a-default-remote" style="position:relative;">Create a default remote<a href="#create-a-default-remote" aria-label="create a default remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now we can add a default to the project with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> bikes azure://bikedata</span></code></pre></div> <p>This creates a default remote called <code>bikes</code> that connects to the <code>bikedata</code> container we made earlier which is where the training data for the model will be stored.</p> <h3 id="add-azure-credentials" style="position:relative;">Add Azure credentials<a href="#add-azure-credentials" aria-label="add azure credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In order for DVC to be able to push and pull data from the remote, you need to have valid Azure credentials.</p> <p>By default, DVC authenticates using your <a href="https://docs.microsoft.com/en-us/cli/azure/install-azure-cli" target="_blank" rel="nofollow noopener noreferrer">Azure CLI</a> configuration.</p> <p>Run the following command to authenticate with Azure.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">az</span> login </span>A web browser has been opened at https://login.microsoftonline.com/organizations/oauth2/v2.0/authorize. Please continue the login in the web browser. If no web browser is available or if the web browser fails to open, use device code flow with `az login --use-device-code`. [ { "cloudName": "AzureCloud", "homeTenantId": "some-id", "id": "some-id", "isDefault": true, "managedByTenants": [], "name": "Azure subscription 1", "state": "Enabled", "tenantId": "some-id", "user": { "name": "[email protected]", "type": "user" } } ]</code></pre></div> <p>This should open a window that looks like this where you can enter your login credentials.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e75c1384fe653dd3dd95bae6893dfc5d/39600/azure_auth_page.png" alt="Azure CLI authentication page" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>You can check out more details on this command <a href="https://docs.microsoft.com/en-us/cli/azure/authenticate-azure-cli" target="_blank" rel="nofollow noopener noreferrer">here in the Azure docs</a>. If you want to use a different authentication method with DVC, check out <a href="https://dvc.org/doc/command-reference/remote/modify#microsoft-azure-blob-storage" target="_blank" rel="nofollow noopener noreferrer">our docs here</a>.</p> <p>You will also need to manually define the storage account name with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> bikes account_name <span class="token string">'bicycleproject'</span></span></code></pre></div> <h3 id="push-and-pull-data-with-dvc" style="position:relative;">Push and pull data with DVC<a href="#push-and-pull-data-with-dvc" aria-label="push and pull data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now you can push data from your local machine to the Azure remote! First, add the data you want DVC to track with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data</span></code></pre></div> <p>This will allow DVC to track the entire <code>data</code> directory so it will note when any changes are made. Then you can push that data to your Azure remote with this command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div> <p>Here's what the data might look like in your Azure container.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7cbb06d18650df41a08fe6a8c35acb3d/39600/data_in_azure.png" alt="data in Azure container" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Then if you move to a different machine or someone else needs to use that data, it can be accessed by cloning or forking the project repo and running:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span></span></code></pre></div> <p>This will get any data from your remote and download it to your local machine.</p> <admon type="info"> <p>Authentication has to be setup locally on any machine you need to pull or push data from. That means running the <code>az login</code> command on any other machine. You don't need to go through the DVC setup again.</p> </admon> <hr> <p>That’s it! Now you can connect any DVC project to an Azure blob storage container. If you run into any issues, makes sure to check that your credentials are valid, check if your user has MFA enabled, and check that the user has the right level of permissions.</p>https://dvc.org/blog/MLEM-releasehttps://dvc.org/blog/MLEM-releaseWed, 01 Jun 2022 00:00:00 GMT<p>With MLEM ML teams get a single tool to <strong>run your models anywhere</strong> that strikes to cover all model productionization scenarios you have.</p> <p>MLEM enables this via <strong>model metadata codification</strong>: saving all information that is required to use a model later. Besides packaging a model for deployment it can be used for many things, including search and documentation. To make it even more convenient, MLEM uses human-readable YAML files for that.</p> <p>Finally, using Git to keep that metainformation allows you to create a <strong>Git-native model registry</strong>, allowing you to handle model lifecycle management in Git, getting all benefits of CI/CD. Which makes your ML team one step closer to GitOps.</p> <p>We built MLEM to address issues that MLOps teams have around managing model information as they move them from training and development to production and, ultimately, retirement. The Git-based model (<a href="https://iterative.ai/why-iterative/" target="_blank" rel="nofollow noopener noreferrer">one of our core philosophies</a>) aligns model operations and deployment with software development teams – information and automation are all based on familiar DevOps tools – so that deploying any model into production is that much faster.</p> <h1 id="model-metadata-codification" style="position:relative;">Model metadata codification<a href="#model-metadata-codification" aria-label="model metadata codification permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Capturing model-specific information requires an understanding of the Programming language and ML frameworks they're created with. That's why MLEM is a Python-specific tool. To provide a developer-first experience, MLEM exposes carefully designed CLI to help you manage DevOps parts of the workflow from CLI and Python API to handle model productionization programmatically.</p> <p>It's easy to start using MLEM, since it integrates nicely into your existing training workflows by adding a couple of lines:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> mlem mlem<span class="token punctuation">.</span>api<span class="token punctuation">.</span>save<span class="token punctuation">(</span> my_model<span class="token punctuation">,</span> <span class="token string">"mlem-model"</span><span class="token punctuation">,</span> sample_data<span class="token operator">=</span>train <span class="token punctuation">)</span></code></pre></div> <p>That produces two files: model binary and model metadata, which is a <code>.mlem</code> file:</p> <div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ <span class="token function">ls</span> models mlem-model mlem-model.mlem</code></pre></div> <p>MLEM automatically detects everything you need to run the model: ML framework, model dependencies (i.e. Python requirements), methods, and input/output data schema (note, that we didn't specify those above at <code>save</code>!).</p> <p>This enables easy codification of arbitrary complex models, such as a Python function in which you average a couple of frameworks or a custom Python class that uses different libraries to generate the features and make a prediction. MLEM saves this information in a simple human-readable YAML file:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># mlem-model.mlem</span> <span class="token key atrule">artifacts</span><span class="token punctuation">:</span> <span class="token key atrule">data</span><span class="token punctuation">:</span> <span class="token key atrule">hash</span><span class="token punctuation">:</span> b7f7e869f2b9270c516b546f09f49cf7 <span class="token key atrule">size</span><span class="token punctuation">:</span> <span class="token number">166864</span> <span class="token key atrule">uri</span><span class="token punctuation">:</span> mlem<span class="token punctuation">-</span>model <span class="token key atrule">description</span><span class="token punctuation">:</span> Random Forest Classifier <span class="token key atrule">labels</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> random<span class="token punctuation">-</span>forest <span class="token punctuation">-</span> classifier <span class="token key atrule">model_type</span><span class="token punctuation">:</span> <span class="token key atrule">methods</span><span class="token punctuation">:</span> <span class="token key atrule">predict_proba</span><span class="token punctuation">:</span> <span class="token key atrule">args</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> data <span class="token key atrule">type_</span><span class="token punctuation">:</span> <span class="token key atrule">columns</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> sepal length (cm) <span class="token punctuation">-</span> sepal width (cm) <span class="token punctuation">-</span> petal length (cm) <span class="token punctuation">-</span> petal width (cm) <span class="token key atrule">dtypes</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> float64 <span class="token punctuation">-</span> float64 <span class="token punctuation">-</span> float64 <span class="token punctuation">-</span> float64 <span class="token key atrule">index_cols</span><span class="token punctuation">:</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token key atrule">type</span><span class="token punctuation">:</span> dataframe <span class="token key atrule">name</span><span class="token punctuation">:</span> predict_proba <span class="token key atrule">returns</span><span class="token punctuation">:</span> <span class="token key atrule">dtype</span><span class="token punctuation">:</span> float64 <span class="token key atrule">shape</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token null important">null</span> <span class="token punctuation">-</span> <span class="token number">3</span> <span class="token key atrule">type</span><span class="token punctuation">:</span> ndarray <span class="token key atrule">type</span><span class="token punctuation">:</span> sklearn <span class="token key atrule">object_type</span><span class="token punctuation">:</span> model <span class="token key atrule">requirements</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">module</span><span class="token punctuation">:</span> sklearn <span class="token key atrule">version</span><span class="token punctuation">:</span> 1.0.2 <span class="token punctuation">-</span> <span class="token key atrule">module</span><span class="token punctuation">:</span> pandas <span class="token key atrule">version</span><span class="token punctuation">:</span> 1.4.1 <span class="token punctuation">-</span> <span class="token key atrule">module</span><span class="token punctuation">:</span> numpy <span class="token key atrule">version</span><span class="token punctuation">:</span> 1.22.3</code></pre></div> <p>To make ML model development Git-native, MLEM can work with DVC to manage versions of a model stored remotely in the cloud. Committing both model metainformation (<code>mlem-model.mlem</code>) and a pointer to the model binary (<code>mlem-model.dvc</code> or <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> if you train it in a DVC pipeline) to Git allows you to enable GitFlow and other Software Engineering best practices like GitOps.</p> <h1 id="running-your-models-anywhere" style="position:relative;">Running your models anywhere<a href="#running-your-models-anywhere" aria-label="running your models anywhere permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>The main goal of MLEM is to provide you with a single tool that enables any kind of model productionization scenarios. For MLEM, there are three main groups of those scenarios:</p> <ul> <li><strong>Use</strong> a model directly with MLEM.</li> <li><strong>Export</strong> a model to a format that can be used by other tools.</li> <li><strong>Deploy</strong> a model to a production environment or cloud provider.</li> </ul> <p>The first one allows you to import your model into a Python runtime, run predict against some dataset directly in the command line, or serve the model with MLEM from your CLI.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">$ python <span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">import</span> mlem <span class="token operator">>></span><span class="token operator">></span> model <span class="token operator">=</span> mlem<span class="token punctuation">.</span>api<span class="token punctuation">.</span>load<span class="token punctuation">(</span><span class="token string">"mlem-model"</span><span class="token punctuation">)</span> <span class="token operator">>></span><span class="token operator">></span> model<span class="token punctuation">.</span>predict<span class="token punctuation">(</span>test<span class="token punctuation">)</span> <span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.3</span><span class="token punctuation">,</span> <span class="token number">0.3</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span> <span class="token number">0.5</span><span class="token punctuation">,</span> <span class="token number">0.3</span><span class="token punctuation">]</span><span class="token punctuation">]</span></code></pre></div> <div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem apply mlem-model test.csv <span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token number">0.4</span>, <span class="token number">0.3</span>, <span class="token number">0.3</span><span class="token punctuation">]</span>, <span class="token punctuation">[</span><span class="token number">0.2</span>, <span class="token number">0.5</span>, <span class="token number">0.3</span><span class="token punctuation">]</span><span class="token punctuation">]</span></code></pre></div> <div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem serve ml-model fastapi ⏳️ Loading model from ml-model.mlem Starting fastapi server<span class="token punctuation">..</span>. 💅 Adding route <span class="token keyword">for</span> /predict 💅 Adding route <span class="token keyword">for</span> /predict_proba Checkout openapi docs at <span class="token operator"><</span>http://0.0.0.0:8080/docs<span class="token operator">></span> INFO: Started server process <span class="token punctuation">[</span><span class="token number">5750</span><span class="token punctuation">]</span> INFO: Waiting <span class="token keyword">for</span> application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 <span class="token punctuation">(</span>Press CTRL+C to quit<span class="token punctuation">)</span></code></pre></div> <p>The second one allows you to export your models as a Python package, build a Docker Image, or export it as some special format (like <code>.onnx</code> which is coming soon).</p> <div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem build mlem-model pip <span class="token parameter variable">-c</span> <span class="token assign-left variable">package_name</span><span class="token operator">=</span>mlem-translate <span class="token parameter variable">-c</span> <span class="token assign-left variable">target</span><span class="token operator">=</span>build/ ⏳️ Loading model from ml-model.mlem 💼 Written <span class="token variable"><span class="token variable">`</span>ml-package<span class="token variable">`</span></span> package data to <span class="token variable"><span class="token variable">`</span>build/<span class="token variable">`</span></span> $ tree build/ build ├── MANIFEST.in ├── ml-package │ ├── __init__.py │ ├── model │ └── model.mlem ├── requirements.txt └── setup.py</code></pre></div> <p>The last one allows you to deploy models to deployment providers, such as Heroku (with AWS Sagemaker and Kubernetes coming soon).</p> <div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ mlem deployment run myservice <span class="token parameter variable">-m</span> mlem-model <span class="token parameter variable">-t</span> staging <span class="token parameter variable">-c</span> <span class="token assign-left variable">app_name</span><span class="token operator">=</span>mlem-quick-start ⏳️ Loading deployment from my-service.mlem 🔗 Loading <span class="token function">link</span> to staging.mlem 🔗 Loading <span class="token function">link</span> to mlem-model.mlem 💾 Updating deployment at my-service.mlem 🛠 Creating <span class="token function">docker</span> image <span class="token keyword">for</span> heroku 🛠 Building MLEM wheel file<span class="token punctuation">..</span>. 💼 Adding model files<span class="token punctuation">..</span>. 🛠 Generating dockerfile<span class="token punctuation">..</span>. 💼 Adding sources<span class="token punctuation">..</span>. 💼 Generating requirements file<span class="token punctuation">..</span>. 🛠 Building <span class="token function">docker</span> image registry.heroku.com/mlem-quick-start/web<span class="token punctuation">..</span>. ✅ Built <span class="token function">docker</span> image registry.heroku.com/mlem-quick-start/web 🔼 Pushing image registry.heroku.com/mlem-quick-start/web to registry.heroku.com ✅ Pushed image registry.heroku.com/mlem-quick-start/web to registry.heroku.com 💾 Updating deployment at my-service.mlem 🛠 Releasing app mlem-quick-start formation 💾 Updating deployment at my-service.mlem ✅ Service mlem-quick-start is up. You can check it out at https://mlem-quick-start.herokuapp.com/</code></pre></div> <p>Since MLEM is both CLI-first and API-first tool, you can productionize your models just as easy with Python API:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">$ python <span class="token operator">>></span><span class="token operator">></span> <span class="token keyword">from</span> mlem<span class="token punctuation">.</span>api <span class="token keyword">import</span> serve<span class="token punctuation">,</span> build<span class="token punctuation">,</span> deploy</code></pre></div> <h1 id="git-native-model-registry" style="position:relative;">Git-native model registry<a href="#git-native-model-registry" aria-label="git native model registry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>MLEM is a core building block for a Git-based ML model registry, together with other Iterative tools, like GTO and DVC.</p> <p>ML model registries give your team key capabilities:</p> <ul> <li>Collect and organize model <a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files" target="_blank" rel="nofollow noopener noreferrer">versions</a> from different sources effectively, preserving their data provenance and lineage information.</li> <li>Share metadata including <a href="https://dvc.org/doc/start/metrics-parameters-plots" target="_blank" rel="nofollow noopener noreferrer">metrics and plots</a> to help use and evaluate models.</li> <li>A standard interface to access all your ML artifacts, from early-stage <a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">experiments</a> to production-ready models.</li> <li>Deploy specific models on different environments (dev, shadow, prod, etc.) without touching the applications that consume them.</li> <li>For security, control who can manage models, and audit their usage trails.</li> </ul> <p>Many of these benefits are built into DVC: Your <a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">modeling process</a> and <a href="https://dvc.org/doc/start/metrics-parameters-plots" target="_blank" rel="nofollow noopener noreferrer">performance data</a> become <strong>codified</strong> in Git-based <abbr>DVC repositories</abbr>, making it possible to reproduce and manage models with standard Git workflows (along with code). Large model files are stored separately and efficiently, and can be pushed to <a href="https://dvc.org/doc/command-reference/remote" target="_blank" rel="nofollow noopener noreferrer">remote storage</a> — a scalable access point for <a href="https://dvc.org/doc/start/data-and-model-access" target="_blank" rel="nofollow noopener noreferrer">sharing</a>.</p> <p>To make a Git-native registry, one option is to use <a href="https://github.com/iterative/gto" target="_blank" rel="nofollow noopener noreferrer">GTO</a> (Git Tag Ops). It tags ML model releases and promotions, and links them to artifacts in the repo using versioned annotations. This creates abstractions for your models, which lets you <strong>manage their lifecycle</strong> freely and directly from Git.</p> <div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$ gto show ╒══════════════════════╤══════════╤════════╤═════════╕ │ name │ latest │ <span class="token comment">#stage │ #prod │</span> ╞══════════════════════╪══════════╪════════╪═════════╡ │ pet-face-recognition │ v3.1.0 │ - │ v3.0.0 │ │ mlem-blep-classifier │ v0.4.1 │ v0.4.1 │ - │ │ dog-bark-translator │ v0.0.1 │ - │ v0.0.1 │ ╘══════════════════════╧══════════╧════════╧═════════╛ $ mlem apply dog-bark-translator ./short-dog-phrase.wav 🐶🚀🎉</code></pre></div> <p>For more information, visit our <a href="https://iterative.ai/model-registry" target="_blank" rel="nofollow noopener noreferrer">model registry page</a>.</p> <h1 id="what-next" style="position:relative;">What next?<a href="#what-next" aria-label="what next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>⭐ <strong>Star <a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">MLEM on GitHub</a></strong> and let us know what you think!</p> <p><img src="https://dvc.org/2022-06-01/mlem-repo-umbrella-dog-6c0b38915bbeb06b0edc21acd71bb3b6.gif" alt="Umbrella dog" title="Machine Learning should be mlemming!"></p> <p>Machine Learning should be mlemming! 🚀</p> <p>Resources:</p> <ul> <li><a href="https://mlem.ai/doc" target="_blank" rel="nofollow noopener noreferrer">Documentation</a></li> <li><a href="https://mlem.ai" target="_blank" rel="nofollow noopener noreferrer">MLEM website</a></li> <li><a href="https://github.com/iterative/mlem" target="_blank" rel="nofollow noopener noreferrer">MLEM on GitHub</a></li> <li><a href="https://iterative.ai/model-registry/" target="_blank" rel="nofollow noopener noreferrer">Building an ML model registry</a></li> </ul> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/aws-remotes-in-dvchttps://dvc.org/blog/aws-remotes-in-dvcTue, 31 May 2022 00:00:00 GMT<p>When you’re working on a data science project that has huge datasets, it’s common to store them in cloud storage. You’ll also be working with different versions of the same datasets to train a model, so it’s crucial to have a tool that enables you to switch between datasets quickly and easily. That’s why we’re going to do a quick walkthrough of how to set up a remote in an AWS S3 bucket and handle data versioning with <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>.</p> <p>We’ll start by creating a new S3 bucket in our AWS account, then we’ll show how you can add DVC to your project. We’ll be working with <a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo</a> if you want an example to play with.</p> <admon type="info"> <p>By the time you finish, you should be able to create this setup for any machine learning project using an AWS remote.</p> </admon> <h2 id="set-up-an-aws-s3-bucket" style="position:relative;">Set up an AWS S3 bucket<a href="#set-up-an-aws-s3-bucket" aria-label="set up an aws s3 bucket permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Make sure that you already have an <a href="https://aws.amazon.com/" target="_blank" rel="nofollow noopener noreferrer">AWS account</a> and log in. Search for <code>S3</code> and it should be the first service that appears.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0d7844098f66ea6a742edb060adc4920/39600/finding_s3.png" alt="S3 service in AWS" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Once you’re on the S3 page, click the <code>Create Bucket</code> button and it will take you to a page that looks like this. The bucket in this example is called <code>updatedbikedata</code> because that is the data our demo repo works with. You can leave the default settings in place or you can update them to fit the functionality you need.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fc60057d7f3d97a8faa8abbaa9ddda79/39600/create_bucket.png" alt="create an S3 bucket in AWS" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Once you’ve created the bucket, you should be redirected to the S3 dashboard and see the success message and your new bucket.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b07616e56bf4dbe8f93680ebb5de0d22/39600/created_bucket.png" alt="newly created S3 bucket in AWS" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="get-your-credentials" style="position:relative;">Get your credentials<a href="#get-your-credentials" aria-label="get your credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now that the S3 bucket is ready, we need the <code>access_key_id</code> and <code>secret_access_key</code> from AWS in order to connect to our project. You can create these keys in your Identity and Access Management settings. Go to your security credentials and select the <code>Access keys</code> section. Then click the <code>Create New Access Key</code> button. This will generate a new set of keys for you so make sure you download this file to get your secret access key.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/60472bc0f92d5ea991eb531263245603/39600/make_credentials.png" alt="make AWS access credentials" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Once you’ve downloaded the credentials, you should see the access key ID in the table. Note that you won’t be able to access your secret key again at this point. You would need to make a new set of credentials if you don’t have it.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2c89ef39ec30bdaf80ddea219fb9e433/39600/credentials.png" alt="successfully created AWS access credentials" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>That’s it for setting up your bucket and getting the credentials you need! Now let’s add DVC to our demo repo and set up the remote.</p> <h2 id="set-up-a-dvc-project" style="position:relative;">Set up a DVC project<a href="#set-up-a-dvc-project" aria-label="set up a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>First, add DVC as a requirement to your project with the following installation command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token string">'dvc[s3]'</span></span></code></pre></div> <p>Then you can initialize DVC in your own project with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span></span></code></pre></div> <p>This will add all of the DVC internals needed to start versioning your data and tracking experiments. Now we need to set up the remote to connect our project data stored in AWS to the DVC repo.</p> <h3 id="create-a-default-remote" style="position:relative;">Create a default remote<a href="#create-a-default-remote" aria-label="create a default remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now we can add a default to the project with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> bikes s3://updatedbikedata</span></code></pre></div> <p>This creates a default remote called <code>bikes</code> that connects to the <code>updatedbikedata</code> bucket we made earlier which is where the training data for the model will be stored.</p> <h3 id="add-aws-credentials" style="position:relative;">Add AWS credentials<a href="#add-aws-credentials" aria-label="add aws credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In order for DVC to be able to push and pull data from the remote, you need to have valid AWS credentials.</p> <p>By default, DVC authenticates using your AWS CLI configuration, if it has been set. You can do that with the <code>aws configure</code> command like in this example:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">aws</span> configure </span>AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: Default output format [None]:</code></pre></div> <p>You can check out more details on this command <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html" target="_blank" rel="nofollow noopener noreferrer">here in the AWS docs</a>.</p> <p>If you want to <a href="https://dvc.org/doc/command-reference/remote/modify#amazon-s3" target="_blank" rel="nofollow noopener noreferrer">use a different authentication method</a> or if you run into issues with the credentials, you can manually add them with the following commands:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--local</span> bikes access_key_id <span class="token string">'mykey'</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--local</span> bikes secret_access_key <span class="token string">'mysecret'</span></span></code></pre></div> <h3 id="push-and-pull-data-with-dvc" style="position:relative;">Push and pull data with DVC<a href="#push-and-pull-data-with-dvc" aria-label="push and pull data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now you can push data from your local machine to the AWS remote! First, add the data you want DVC to track with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data</span></code></pre></div> <p>This will allow DVC to track the entire <code>data</code> directory so it will note when any changes are made. Then you can push that data to your AWS remote with this command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div> <p>Here's what the data might look like in your AWS bucket.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8995e2ee92b7f960ca732dad0b0d802e/39600/aws_bucket.png" alt="data in AWS bucket" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Then if you move to a different machine or someone else needs to use that data, it can be accessed by cloning or forking the project repo and running:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span></span></code></pre></div> <p>This will get any data from your remote and download it to your local machine.</p> <admon type="info"> <p>Authentication has to be setup locally on any machine you need to pull or push data from.</p> </admon> <hr> <p>That’s it! Now you can connect any DVC project to an AWS S3 bucket. If you run into any issues, makes sure to check that your credentials are valid, check if your user has MFA enabled, and check that the user has the right level of permissions.</p>https://dvc.org/blog/may-22-community-gemshttps://dvc.org/blog/may-22-community-gemsThu, 26 May 2022 00:00:00 GMT<h3 id="is-it-possible-to-export-a-plot-generated-using-dvc-plots-diff-head-main-to-vega-lite-for-use-in-cml" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/965911829538832435" target="_blank" rel="nofollow noopener noreferrer">Is it possible to export a plot generated using <code>dvc plots diff HEAD main</code> to vega-lite for use in CML?</a><a href="#is-it-possible-to-export-a-plot-generated-using-dvc-plots-diff-head-main-to-vega-lite-for-use-in-cml" aria-label="is it possible to export a plot generated using dvc plots diff head main to vega lite for use in cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the awesome question @dominic!</p> <p>You can use the <a href="https://dvc.org/doc/command-reference/plots/diff#--show-vega"><code>dvc plots diff --show-vega</code></a> command to export the plot to vega-lite on a single graph. You'll need to run the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> HEAD main <span class="token parameter variable">--targets</span> prediction.json <span class="token parameter variable">--show-vega</span> <span class="token operator">></span> vega.json</span></code></pre></div> <p>You can also include this plot in a comment with CML so that it appears on your pull requests in GitHub.</p> <h3 id="what-is-the-difference-between-dvc-pull-and-dvc-checkout" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/966739538888241192" target="_blank" rel="nofollow noopener noreferrer">What is the difference between <code>dvc pull</code> and <code>dvc checkout</code>?</a><a href="#what-is-the-difference-between-dvc-pull-and-dvc-checkout" aria-label="what is the difference between dvc pull and dvc checkout permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Great question @Derek!</p> <p>Here are some explanations around how <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> work. They're comparable to <code>git pull</code> and <code>git checkout</code>.</p> <ul> <li><a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> fetches data from your remote cache to your local cache and syncs it to your workspace</li> <li><a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> syncs data from your local cache to your workspace</li> </ul> <h3 id="is-there-a-way-to-add-all-of-the-outs-of-a-foreach-job-to-the-deps-of-a-downstream-stage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/967709548393672734" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add all of the <code>outs</code> of a <code>foreach</code> job to the <code>deps</code> of a downstream stage?</a><a href="#is-there-a-way-to-add-all-of-the-outs-of-a-foreach-job-to-the-deps-of-a-downstream-stage" aria-label="is there a way to add all of the outs of a foreach job to the deps of a downstream stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Very interesting question from @mathematiguy!</p> <p>One way to do this is to have all <code>foreach</code> stages write out to different paths within the same directory and then track the entire directory as a dependency of your downstream stage.</p> <p>Here's an example of how that might look in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">cleanups</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> raw1 <span class="token punctuation">-</span> labels1 <span class="token punctuation">-</span> raw2 <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> echo "$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>" <span class="token punctuation">></span> "data/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>" <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span> <span class="token key atrule">reduce</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> echo file <span class="token punctuation">></span> file <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> file</code></pre></div> <h3 id="is-there-a-way-to-version-and-move-data-from-one-cloud-storage-to-another-with-dvc-remotes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/968778284114538496" target="_blank" rel="nofollow noopener noreferrer">Is there a way to version and move data from one cloud storage to another with DVC remotes?</a><a href="#is-there-a-way-to-version-and-move-data-from-one-cloud-storage-to-another-with-dvc-remotes" aria-label="is there a way to version and move data from one cloud storage to another with dvc remotes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Wonderful question from @Hisham!</p> <p>There are a couple of ways you can do this. One approach is to use <a href="https://dvc.org/doc/command-reference/add#--to-remote"><code>dvc add --to-remote</code></a>.</p> <p>The other approach is to use the <a href="https://dvc.org/doc/command-reference/import-url#example-transfer-to-remote-storage" target="_blank" rel="nofollow noopener noreferrer"><code>import-url --to-remote</code></a> functionality. The main difference between these approaches is that <a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a> has the added benefit of keeping a connection to the data source so it can be updated later with <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a>.</p> <p>You can see an example of how to do this in the docs. Just make sure that you have your remotes set up!</p> <h3 id="if-im-using-feast-feature-store-is-it-possible-to-version-datasets-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/968899175561449532" target="_blank" rel="nofollow noopener noreferrer">If I'm using Feast feature store, is it possible to version datasets with DVC?</a><a href="#if-im-using-feast-feature-store-is-it-possible-to-version-datasets-with-dvc" aria-label="if im using feast feature store is it possible to version datasets with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a good integration question from @Bernardo Galvao!</p> <p>If you want to fetch historical features from the offline store to generate training data, a typical pattern would be to write the script to do so and set up a DVC pipeline stage to track that script and version the output file. This is similar to how a lot of people use DVC alongside SQL databases.</p> <h3 id="how-can-i-run-a-dvc-pipeline-in-a-docker-container" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/969640280263389184" target="_blank" rel="nofollow noopener noreferrer">How can I run a DVC pipeline in a Docker container?</a><a href="#how-can-i-run-a-dvc-pipeline-in-a-docker-container" aria-label="how can i run a dvc pipeline in a docker container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Nice question from @Anudeep!</p> <p>Here's an example of a Dockerfile with a simple DVC setup.</p> <div class="gatsby-highlight" data-language="docker"><pre class="language-docker"><code class="language-docker"><span class="token instruction"><span class="token keyword">FROM</span> ubuntu:latest</span> <span class="token instruction"><span class="token keyword">RUN</span> apt-get update && apt install -y python-is-python3 python3-pip</span> <span class="token instruction"><span class="token keyword">WORKDIR</span> /dvc_project</span> <span class="token instruction"><span class="token keyword">COPY</span> . .</span> pip install -r requirements.txt # assuming your requirements, including dvc, are here <span class="token instruction"><span class="token keyword">CMD</span> dvc pull && dvc exp run</span></code></pre></div> <p>You would save this file and then run the following commands in your terminal.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> build <span class="token parameter variable">-t</span> <span class="token string">"myproject-dvc-exp-run"</span> <span class="token builtin class-name">.</span> </span><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> run myproject-dvc-exp-run</span></code></pre></div> <p>You could also use the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command or any of the other DVC commands.</p> <h3 id="how-can-i-reset-a-repository-and-start-fresh-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/970344379938127892" target="_blank" rel="nofollow noopener noreferrer">How can I reset a repository and start fresh with DVC?</a><a href="#how-can-i-reset-a-repository-and-start-fresh-with-dvc" aria-label="how can i reset a repository and start fresh with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Nice question from @strickvl!</p> <p>The best approach for resetting a repo is to run the <a href="https://dvc.org/doc/command-reference/destroy"><code>dvc destroy</code></a> command that will remove all DVC file and internals from your repository.</p> <h3 id="is-there-an-example-of-using-cml-with-gcp-that-can-be-used-as-a-reference" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/963512513452970086" target="_blank" rel="nofollow noopener noreferrer">Is there an example of using CML with GCP that can be used as a reference?</a><a href="#is-there-an-example-of-using-cml-with-gcp-that-can-be-used-as-a-reference" aria-label="is there an example of using cml with gcp that can be used as a reference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Excellent question from @sabygo!</p> <p>Here is a GitHub Actions snippet to get you started:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">setup</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">GOOGLE_APPLICATION_CREDENTIALS_DATA</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GCP_CML_RUNNER_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml runner \ --single \ --labels=cml-gcp \ --token=${{ secrets.GCP_SECRET }} \ --cloud=gcp \ --cloud-region=us-west \ --cloud-type=e2-highcpu-2</span> <span class="token key atrule">test</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>setup<span class="token punctuation">]</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>gcp<span class="token punctuation">]</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token comment"># - uses: iterative/setup-cml@v1</span> <span class="token punctuation">-</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> echo "model training"</span></code></pre></div> <h3 id="can-i-use-preemptive-instances-provided-by-gcp-as-a-cml-runner" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/964860322710192202" target="_blank" rel="nofollow noopener noreferrer">Can I use preemptive instances provided by GCP as a <code>cml-runner</code>?</a><a href="#can-i-use-preemptive-instances-provided-by-gcp-as-a-cml-runner" aria-label="can i use preemptive instances provided by gcp as a cml runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Good question from @Atsu!</p> <p>Yes! You can use <code>cml runner --cloud-spot</code> to request a preemptive instance.</p> <hr> <p><img src="https://media.giphy.com/media/bg1MQ6IUVoVOM/giphy.gif" alt="We Did It Smiling GIF"></p> <p>At our June Office Hours Meetup we will be the launch party for our new MLOps tool! Make sure you join us to find out what it is! <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285789441/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/local-experiments-to-cloud-with-tpi-dockerhttps://dvc.org/blog/local-experiments-to-cloud-with-tpi-dockerTue, 24 May 2022 00:00:00 GMT<p>We recently <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi">published a tutorial</a> on using <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider Iterative (TPI)</a> to move a machine learning experiment from your local computer to a more powerful cloud machine. We've covered how you can use <a href="https://www.terraform.io" target="_blank" rel="nofollow noopener noreferrer">Terraform</a> & TPI to provision infrastructure, sync data, and run training scripts. To simplify the setup, we used a pre-configured <a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-image" target="_blank" rel="nofollow noopener noreferrer">Ubuntu/NVIDIA image</a>. However, instead of using a pre-configured image, we can use custom <a href="https://www.docker.com" target="_blank" rel="nofollow noopener noreferrer">Docker</a> images. These are often <a href="https://aws.amazon.com/blogs/opensource/why-use-docker-containers-for-machine-learning-development/" target="_blank" rel="nofollow noopener noreferrer">recommended in machine learning</a> as well as in traditional software development.</p> <admon type="info"> <p>Using Docker to manage dependencies (e.g. Python packages) does not remove all other setup requirements. You'll still need Docker itself installed, as well as GPU runtime drivers if applicable. Happily, TPI sets up all of this by default.</p> </admon> <p>When confronted with cloud infrastructure and dependencies, people often think "oh no, not again" (much <a href="https://www.youtube.com/watch?v=THSY7-CxKnQ" target="_blank" rel="nofollow noopener noreferrer">like the petunias</a> in the cover image). To solve this, separating dependencies into Docker images gives more control over software versions, and also makes it painless to switch between cloud providers — currently Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and Kubernetes. Your Docker image is cloud provider-agnostic. There are thousands of <a href="https://hub.docker.com/" target="_blank" rel="nofollow noopener noreferrer">pre-defined Docker images online</a> too.</p> <p>In this tutorial, we'll use an existing Docker image that comes with most of our requirements already installed. We'll then add add a few more dependencies on top and run our training pipeline in the cloud as before!</p> <h2 id="run-gpu-enabled-docker-containers" style="position:relative;">Run GPU-enabled Docker containers<a href="#run-gpu-enabled-docker-containers" aria-label="run gpu enabled docker containers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <admon type="warn"> <p>If you haven't read the <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi">previous tutorial</a>, you should check out the basics there first. This includes how to let Terraform know about TPI, and essential commands (<code>init</code>, <code>apply</code>, <code>refresh</code>, <code>show</code>, and <code>destroy</code>).</p> </admon> <p>The only modification from the <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi">previous tutorial</a> is the script part of the <code>main.tf</code> config file.</p> <p>Let's say we've found a carefully prepared a Docker image suitable for data science and machine learning — in this case, <a href="https://cml.dev/doc/self-hosted-runners#docker-images" target="_blank" rel="nofollow noopener noreferrer"><code>iterativeai/cml:0-dvc2-base1-gpu</code></a>. This image comes loaded with Ubuntu 20.04, Python 3.8, NodeJS, CUDA 11.0.3, CuDNN 8, Git, <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a>, <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, and other essentials for full-stack data science.</p> <p>Our <code>script</code> block is now:</p> <div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END #!/bin/bash docker run --gpus all -v "$PWD:/tpi" -w /tpi -e TF_CPP_MIN_LOG_LEVEL \ iterativeai/cml:0-dvc2-base1-gpu /bin/bash -c " pip install -r requirements.txt tensorflow==2.8.0 python train.py --output results-gpu/metrics.json " END</span></code></pre></div> <p>Yes, it's quite long for a one-liner. Let's looks at the components:</p> <ul> <li><code>docker run</code>: Download the specified image, create a container from the image, and run it.</li> <li><code>--gpus all</code>: Expose GPUs to the container.</li> <li><code>-v "$PWD:/tpi"</code>: Expose our current working directory (<code>$PWD</code>) within the container (as path <code>/tpi</code>).</li> <li><code>-w /tpi</code>: Set the working directory of the container (to be <code>/tpi</code>).</li> <li><code>-e TF_CPP_MIN_LOG_LEVEL</code>: Expose the environment variable <code>TF_CPP_MIN_LOG_LEVEL</code> to the container (in this case to control TensorFlow's verbosity).</li> <li><code>iterativeai/cml:0-dvc2-base1-gpu</code>: The image we want to download & run a container from.</li> <li><code>/bin/bash -c "pip install -r requirements.txt ... python train.py ..."</code>: Commands to run within the container's working directory. In this case, install the dependencies and run the training script.</li> </ul> <p>We can now call <code>terraform init</code>, <code>export TF_LOG_PROVIDER=INFO</code>, and <code>terraform apply</code> to provision infrastructure, upload our data and code, set up the cloud environment, and run the training process. If you'd like to tinker with this example you can <a href="https://github.com/iterative/blog-tpi-bees/tree/docker" target="_blank" rel="nofollow noopener noreferrer">find it on GitHub</a>.</p> <admon type="tip"> <p>Don't forget to <code>terraform refresh && terraform show</code> to check the status, and <code>terraform destroy</code> to download results & shut everything down.</p> </admon> <p>Now you know the basics of using convenient Docker images together with <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">TPI</a> for provisioning your MLOps infrastructure!</p> <admon type="tip"> <p>If you have a lot of custom dependencies that rarely change (e.g. a large <code>requirements.txt</code> that is rarely updated), it's a good idea to build it into your own custom Docker image. Let us know if you'd like a tutorial on this!</p> </admon>https://dvc.org/blog/may-22-heartbeathttps://dvc.org/blog/may-22-heartbeatMon, 16 May 2022 00:00:00 GMT<h1 id="aiml-news" style="position:relative;">AI/ML News<a href="#aiml-news" aria-label="aiml news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 200px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9638ba164ff65ee833ba12eb47f5694d/0988f/chip-huyen.jpg" alt="Designing Machine Learning Systems" title="Designing Machine Learning Systems" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h2 id="chip-huyen-designing-machine-learning-systems" style="position:relative;">Chip Huyen: Designing Machine Learning Systems<a href="#chip-huyen-designing-machine-learning-systems" aria-label="chip huyen designing machine learning systems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/chiphuyen/" target="_blank" rel="nofollow noopener noreferrer"><strong>Chip Huyen</strong></a> just came out with a new book with <a href="https://oreilly.com" target="_blank" rel="nofollow noopener noreferrer">O'Reilly</a> entitled <a href="https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/" target="_blank" rel="nofollow noopener noreferrer">Designing Machine Learning Systems</a>.<br> I'm not going to pontificate here; Chip Huyen wrote it, the reviews are shining, need I say more?</p> <h2 id="jenny-abramov-an-agile-framework-for-ai-projects--development-qa-deployment-and-maintenance" style="position:relative;">Jenny Abramov: An Agile Framework for AI Projects — Development, QA, Deployment and Maintenance<a href="#jenny-abramov-an-agile-framework-for-ai-projects--development-qa-deployment-and-maintenance" aria-label="jenny abramov an agile framework for ai projects development qa deployment and maintenance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/jennyabramov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jenny Abramov</strong></a> <a href="https://towardsdatascience.com/an-agile-framework-for-ai-projects-development-cbe115ba86a2" target="_blank" rel="nofollow noopener noreferrer">wrote a piece</a> in <a href="https://towardsdatascience.com/" target="_blank" rel="nofollow noopener noreferrer">Toward Data Science</a> with the purpose to present an "iterative-lifecycle framework," that is adapted to AI-centered software. She outlines important considerations as you work through the framework that depends on your use case, data, and business problem.</p> <p>She suggests using DVC for your larger, more complex datasets and also about the need for reproducibility in experimentation with which DVC can help you <a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">(see Technical Product Manager, Dave Berenbaum’s post on experiment versioning.)</a></p> <p>In addition, she discusses issues with quality assurance in deployment and the maintenance of the system.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ade289bcbc573399c5dd5db19fc2d749/39600/jenny-abramov.png" alt="Jenny Abromov iterative-lifecycle framework" title="Jenny Abromov iterative-lifecycle framework" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Jenny Abramov's iterative-lifecycle framework (<a href="https://towardsdatascience.com/an-agile-framework-for-ai-projects-development-cbe115ba86a2" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="mlops-guide-from-innoq" style="position:relative;">MLOps Guide from INNOQ<a href="#mlops-guide-from-innoq" aria-label="mlops guide from innoq permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/larysavisenger/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dr. Larysa Visengeriyeva</strong></a>, <a href="https://www.linkedin.com/in/anja-kammer-berlin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Anja Kammer,</strong></a> <a href="https://www.linkedin.com/in/isabel-b%C3%A4r-a89705213/" target="_blank" rel="nofollow noopener noreferrer"><strong>Isabel Bär,</strong></a> <a href="https://www.linkedin.com/in/alexander-kniesz-656256197/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alexander Kniesz,</strong></a> and <a href="https://www.linkedin.com/in/michael-ploed/" target="_blank" rel="nofollow noopener noreferrer"><strong>Michael Plöd</strong></a> of <a href="https://www.innoq.com/en/" target="_blank" rel="nofollow noopener noreferrer"><strong>INNOQ</strong></a> (a software development, strategy, and technology consultancy) created <a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="nofollow noopener noreferrer">this</a> very thorough resource on MLOps, going through all the principles and "iterative-incremental" steps of the process (there's an iterative pattern here 😉). The authors cover Automation, Continuous X (hello CML and TPI), Versioning (hello DVC!), Experiments Tracking (noted DVC here because indeed DVC does experiment tracking too!), Testing, Monitoring, the "ML Test Score" System, Reproducibility, Modularity, ML-based Software Delivery Metrics, and MLOps Principles and Best Practices. Definitely a good resource for for MLOps and filled with more resources as well.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0eb187e7f2573e5f7cb8160fb74297e1/03346/innoq.jpg" alt="INNOQ MLOps Guide" title="INNOQ MLOps Guide" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>INNOQ MLOps Guide (<a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>Also interesting from INNOQ is their <a href="https://www.innoq.com/en/artists/" target="_blank" rel="nofollow noopener noreferrer">Artist-in-residence program</a> created because they "believe in the conscious reflection between technology and society" and feel art is well suited for this refection. See the work below by Studio Waltz Binaire based on the question: What traces do we leave behind with technology?</p> <p><img src="https://media.giphy.com/media/NxdrJ6a4IQKyW5gGjL/giphy.gif" alt="Waltz Binaire GIF"></p> <p>(<a href="https://www.innoq.com/en/artists/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</p> <h2 id="laszlo-sragner-linkedin-discussion-on-code-quality" style="position:relative;">Laszlo Sragner: LinkedIn discussion on Code Quality<a href="#laszlo-sragner-linkedin-discussion-on-code-quality" aria-label="laszlo sragner linkedin discussion on code quality permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/laszlosragner/?trk=public_post-embed_share-update_actor-text&originalSubdomain=uk" target="_blank" rel="nofollow noopener noreferrer"><strong>Laszlo Sragner</strong></a> a frequent contributor to the MLOps Community in general, often driving discussions and helping others in the <a href="https://mlops-community.slack.com/join/shared_invite/zt-178s99cyv-Q~whRpqbhgMTBrOcbjnDIQ#/shared-invite/email" target="_blank" rel="nofollow noopener noreferrer">MLOps Community Slack channel,</a> posed an interesting point about code quality on LinkedIn. Join the discussion and weigh in at this post:</p> <div class="gatsby-resp-iframe-wrapper" style="padding-bottom: 158.73015873015873%; position: relative; height: 0; overflow: hidden; "> <iframe src="https://www.linkedin.com/embed/feed/update/urn:li:share:6931541880090324992" frameborder="0" allowfullscreen title="Embedded post" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> <h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="icymi-we-released-tpi-" style="position:relative;">ICYMI: We released TPI! 🎉<a href="#icymi-we-released-tpi-" aria-label="icymi we released tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>On April 27th we released the latest offering to our tool ecosystem.</p> <p><img src="https://media.giphy.com/media/ut7lqhIfOscbjuU6YQ/giphy.gif" alt="Celebrate GIF"></p> <p><a href="https://tpi.cml.dev" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider Iterative (TPI)</a> is a Terraform plugin built with machine learning in mind. Full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)… without needing to be a cloud expert.</p> <ul> <li> <p><strong>Lower cost with spot recovery:</strong> transparent data checkpoint/restore & auto-respawning of low-cost spot/preemptible instances</p> </li> <li> <p><strong>No cloud vendor lock-in:</strong> switch between clouds with just one line thanks to unified abstraction</p> </li> <li> <p><strong>No waste:</strong> auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use</p> </li> <li> <p><strong>Developer-first experience:</strong> one-command data sync & code execution with no external server, making the cloud feel like a laptop</p> </li> <li> <p>⭐️ <a href="https://tpi.cml.dev" target="_blank" rel="nofollow noopener noreferrer">Star the Repo</a></p> </li> <li> <p>✍🏼 <a href="https://dvc.org/blog/terraform-provider" target="_blank" rel="nofollow noopener noreferrer">Read the release blog post</a></p> </li> <li> <p>⚙️ Read: <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi" target="_blank" rel="nofollow noopener noreferrer">Moving Local Experiments to the Cloud with Terraform Provider Iterative (TPI) tutorial</a></p> </li> <li> <p>🎥 <a href="https://www.youtube.com/watch?v=2fEgO8SazSE&t=2s" target="_blank" rel="nofollow noopener noreferrer">Watch the video</a></p> </li> <li> <p>🪐 <a href="https://github.com/iterative/blog-tpi-jupyter" target="_blank" rel="nofollow noopener noreferrer">TPI with Jupyter Notebooks Repo</a></p> </li> </ul> <p>Stay tuned for more tutorials and use cases to come!</p> <p><img src="https://media.giphy.com/media/MrCYIN3x0SgdG/giphy.gif" alt="Tom Cruise GIF"></p> <h2 id="mission-impossible---we-have-a-mission-statement" style="position:relative;">🚀<del>Mission Impossible</del> - We have a mission statement!<a href="#mission-impossible---we-have-a-mission-statement" aria-label="mission impossible we have a mission statement permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We did it! This year we surveyed the entire team to arrive at a mission statement for Iterative. It was no small feat to decide on what it should be given the early stage of our industry, the variety of our tools, and always a struggle - figuring out the best and most concise way to convey these ideas (you know our penchant for abbreviations). But we persevered and succeeded. Behold Iterative's new mission statement:</p> <blockquote> <p>We deliver the best developer experience for machine learning teams by creating an ecosystem of open, modular ML tools.</p> </blockquote> <p>As always the door is open for your feedback on how we can serve your needs better!</p> <h2 id="odsc-east" style="position:relative;">ODSC East<a href="#odsc-east" aria-label="odsc east permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We attended our first post-pandemic, in-person conference in Boston last month. It was awesome to be together as a team, see <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a>, <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milicia McGregor</strong></a>, and <a href="https://www.linkedin.com/in/alex000kim/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> in action, and talk to attendees and other vendors at the conference. We are looking forward to <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> next month!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2dc474d99a4acbf874536ef6c50f8403/03346/odsc.jpg" alt="Iterative Team at ODSC East" title="Iterative Team at ODSC East" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative team (left to right) - Mike Moynihan, me, Dave Berenbaum, Daniel Barnes, (DeeVee), Rob De Wit, Milicia McGregor, Dmitry Petrov, Jervis Hui, Alex Kim, Chaz Black</em></p> <h2 id="-tons-of-new-content-on-the-blog" style="position:relative;">✍🏼 Tons of new content on the blog<a href="#-tons-of-new-content-on-the-blog" aria-label=" tons of new content on the blog permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our team has been on fire creating content for you. 🔥 Don't miss the following:</p> <ul> <li>Needing to get started with CML and AWS? <a href="https://www.linkedin.com/in/rcdewit?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAA5CEPkB9fI02IpClBKhRdq2brULPHMhmR8&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3B9MrcxBhQSG6IKzSgJDyfQQ%3D%3D" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob de Wit</strong></a> shows you how to train and save your models with CML in a two-part series using a <a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">self-hosted AWS EC2 runner</a> and <a href="https://dvc.org/blog/CML-runners-saving-models-2" target="_blank" rel="nofollow noopener noreferrer">with CML and DVC on a dedicated AWS EC2 runner</a></li> <li>The <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines" target="_blank" rel="nofollow noopener noreferrer">Part 1</a>, <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments" target="_blank" rel="nofollow noopener noreferrer">Part 2</a> and <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-3-remote-exp-ci-cd" target="_blank" rel="nofollow noopener noreferrer">Part 3</a> tutorials of <a href="https://www.linkedin.com/in/alex000kim/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim's</strong></a> End-to-End Computer Vision API project are out and filled with great learning!</li> <li><a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> brings the monthly roundup of the Community's best technical questions in our latest <a href="https://dvc.org/blog/april-22-community-gems" target="_blank" rel="nofollow noopener noreferrer">Community Gems.</a> 💎</li> </ul> <h2 id="-shiny-new-docs" style="position:relative;">✨ Shiny New Docs<a href="#-shiny-new-docs" aria-label=" shiny new docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have a <a href="https://dvc.org/doc/start/experiments/visualization" target="_blank" rel="nofollow noopener noreferrer">new doc page</a> showcasing the new visualizations added to the <a href="https://github.com/iterative/example-dvc-experiments" target="_blank" rel="nofollow noopener noreferrer">example-dvc-experiments repo</a>.<br> Whether you need to create plots from tabular data, user-generated plots, or autogenerating plots from deep learning code, we've got you covered.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d74bcfcd5fcdcca1aa86f048846e2334/39600/dvc-visualization-doc.png" alt="DVC Visualization Doc" title="DVC Visualization Doc" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC Visualization Doc (<a href="https://towardsdatascience.com/an-agile-framework-for-ai-projects-development-cbe115ba86a2" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="dmitry-petrov-on-tfir-about-terraform-provider-iterative-tpi" style="position:relative;">Dmitry Petrov on TFIR about Terraform Provider Iterative (TPI)<a href="#dmitry-petrov-on-tfir-about-terraform-provider-iterative-tpi" aria-label="dmitry petrov on tfir about terraform provider iterative tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> recently sat down with <a href="https://twitter.com/SwapBhartiya" target="_blank" rel="nofollow noopener noreferrer"><strong>Swapnil Bhartiya</strong></a> of <a href="https://www.tfir.io/" target="_blank" rel="nofollow noopener noreferrer">TFIR</a> to have a chat about TPI. Learn how to save your team valuable resources in your machine learning projects with Terraform Provider Iterative (TPI). You can watch the recording below.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/x-xiKzlQFjY?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="-join-our-release-party-meetup" style="position:relative;">🥳 Join our Release Party Meetup<a href="#-join-our-release-party-meetup" aria-label=" join our release party meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have another tool ready to debut on May 24th. On the 25th we'd love to have you join us for a Release Party Meetup. We will be celebrating the release of the new addition to our open-source tool ecosystem and have a demo of said tool! To join the fun, <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285789441/" target="_blank" rel="nofollow noopener noreferrer">RSVP to the Meetup </a> and mark your calendar!</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285789441/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">New Tool Release Party</h4> <div class="elp-description">Join us May 25th. RSVP for New Tool Release Party!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-05-16/release-party-meetup-c6da47fce4ec95bd0414fe06e5e45c0c.png" alt="New Tool Release Party"> </div> </a> </section> <p></p> <h2 id="new-hires" style="position:relative;">New hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href=""><strong>Wolmir Nemitz</strong></a> is our first team member from South America! We're getting closer to covering all the continents on <a href="https://iterative.ai/about" target="_blank" rel="nofollow noopener noreferrer">our remote team map</a>! From Brazil, Wolmir joins us as an Engineer for the 🤫 team (you'll find out June 14th). Wolmir has four dogs, two tortoises, and a budgie! 🦜</p> <p><a href="https://www.linkedin.com/in/ufijuice/" target="_blank" rel="nofollow noopener noreferrer"><strong>Pavel Chekmaryov</strong></a> joins us in People Operations, managing the hiring pipeline from Frankfurt, Germany, but soon to be Canada! He has spent the last eight years in startups, most recently at OccurAI, reinventing recruitment in the deep-tech/ML field. We look forward to him helping to grow our amazing team!</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Even with our amazing new additions to the team, we're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions and share with anyone you think may be interested! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h1 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="yet-another-tool-comparison-imagine-that" style="position:relative;">Yet another tool comparison, imagine that!<a href="#yet-another-tool-comparison-imagine-that" aria-label="yet another tool comparison imagine that permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><img src="https://media.giphy.com/media/lWa7aAo62YZLwtk3nj/giphy.gif" alt="Cant Believe There You Are GIF"></p> <p>So each month I tell you about yet another post to help you attempt to make sense of the vast MLOps tool space. Well, this month is no different. I mean you could be new here, right? 🤷🏽‍♀️ <a href="https://dolthub.com" target="_blank" rel="nofollow noopener noreferrer">DoltHub</a> tries to bring some clarity <a href="https://www.dolthub.com/blog/2022-04-27-data-version-control/" target="_blank" rel="nofollow noopener noreferrer">with this piece</a> by comparing different data versioning tools and the intricacies of each. You do your research. You know we're partial.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/927c16986aa3b3e5f268930bb780460d/39600/data-version-control.png" alt="Data Version Control tools" title="Data Version Control tools" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Data Version control tools (<a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>I’m starting to wonder if all Data Science/AI teams need a role with the sole responsibility of the job to keep up to date with all the new tooling and changes/updates to existing tooling in the MLOps space and what might best work for the team. What should this position be called? The best answer wins a DVC t-shirt. See <a href="https://twitter.com/DVCorg/status/1526286089551433728?s=20&t=nV3FQAso441MtvrckYAOJA" target="_blank" rel="nofollow noopener noreferrer">this Twitter thread</a> to answer. (Hint: Funny answers will likely win 😉). Deadline: May 31st. Pass it around…</p> <h2 id="andrey-cheptsov-notebooks-and-mlops-choose-one" style="position:relative;">Andrey Cheptsov: Notebooks and MLOps. Choose One.<a href="#andrey-cheptsov-notebooks-and-mlops-choose-one" aria-label="andrey cheptsov notebooks and mlops choose one permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/andrey-cheptsov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Andrey Cheptsov</strong></a> writes <a href="https://mlopsfluff.dstack.ai/p/notebooks-and-mlops-choose-one?s=r" target="_blank" rel="nofollow noopener noreferrer">a piece</a> pointing out how Jupyter Notebooks, while rightfully loved in data science work, fail pretty miserably in a production environment and the reliance on them can cause bad habits. He notes that he's found:</p> <blockquote> <p>For any ML model, the time spent in a Jupyter notebook is inversely proportional to its reproducibility. The reasons behind this rule are poor modularity and reusability of the code in notebooks, and poor integration with Git. - Andrey Cheptsov</p> </blockquote> <p>He advocates for training your models using Python scripts, Git, and CI/CD to automatically shift your foucus to creating reusable, testable code, and to use tools like <a href="https://gradio.app/" target="_blank" rel="nofollow noopener noreferrer">Gradio</a> and <a href="https://streamlit.io/" target="_blank" rel="nofollow noopener noreferrer">Streamlit</a> to provide the interactivity of Jupyter notebooks. Sounds like a promising idea. 💡</p> <p><img src="https://media.giphy.com/media/qxtxlL4sFFle/giphy.gif" alt="Confused The Interview GIF"></p> <h2 id="beyond-ml" style="position:relative;">Beyond ML<a href="#beyond-ml" aria-label="beyond ml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As noted above in our shiny new mission statement, our focus is to make tools for machine learning teams. It has however come to our attention that more and more users are using our tools for non-ML use cases.</p> <p><a href="https://drorspei.wordpress.com/about/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dror Speiser</strong></a> writes about a non-ML use case in <a href="https://drorspei.wordpress.com/2021/09/15/a-new-recipe-for-reproducible-cloud-deployments/" target="_blank" rel="nofollow noopener noreferrer">A New Recipe for Idempotent Cloud Deployments</a> in which he provides a tutorial for doing just that with DVC.</p> <p>The benefits of the approach are:</p> <blockquote> <ol> <li>Changing one artifact’s code does not force rebuilding other artifacts, even if you’re building on a new VM every time.</li> <li>Changing only the deployment script won’t build any artifacts at all.</li> <li>You have an artifact repository that just works.</li> <li>Your Git history contains the hashes of all built artifacts.</li> <li>You can look up any artifact using its hash.</li> </ol> </blockquote> <p>We have opened up a #beyond-ml channel in our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord Server</a>. Do stop by and chat about alternate uses for our tools!</p> <h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li>📣 Our next in-person conference will be <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> from June 7-10 in Toronto! We look forward to seeing Community members there!</li> <li>📣 PyLadies Berlin is hosting <strong>Doreen</strong>, a data scientist working at <a href="https://opinary.com/" target="_blank" rel="nofollow noopener noreferrer">Opinary</a>, who will be presenting "Reproducible Machine Learning with DVC and Poetry" on May 17th. <a href="https://www.meetup.com/PyLadies-Berlin/events/285313817/" target="_blank" rel="nofollow noopener noreferrer">Join the event here.</a></li> <li>📣 <a href="https://www.linkedin.com/in/nicolas-eiris/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nicolás Eiras</strong></a> will be presenting "Data Versioning: Towards Reproducibility in Machine Learning" at <a href="https://embeddedvisionsummit.com/2022/session/data-versioning-towards-reproducibility-in-machine-learning/" target="_blank" rel="nofollow noopener noreferrer">Embedded Vision Summit</a> on May 18th in Santa Clara, California.</li> <li>📣 <a href="https://www.meetup.com/PyData-MTL/" target="_blank" rel="nofollow noopener noreferrer">Montreal PyData</a> will host a <a href="https://www.meetup.com/PyData-MTL/events/285894672/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a> on June 16th with two presentations, "Introduction to Trustworthy Machine Learning for the Enterprise" by <a href="https://www.linkedin.com/in/mohamedleila/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mohamed Leila</strong></a>, ServiceNow and "ML in production in the video game industry: Ubisoft's use case" by <a href="https://www.linkedin.com/in/jeanmicheldaignan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jean-Michel Daignan</strong></a>, Ubisoft</li> </ul> <h2 id="other-fun-stuff" style="position:relative;">Other Fun Stuff<a href="#other-fun-stuff" aria-label="other fun stuff permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://github.com/gaocegege/awesome-open-source-mlops" target="_blank" rel="nofollow noopener noreferrer">New Awesome list</a></li> <li><a href="https://www.udemy.com/course/dvc-and-git-for-data-science/" target="_blank" rel="nofollow noopener noreferrer">New Udemy Course including DVC</a> (But don't forget <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">our online course!</a>)</li> <li>Would you like to get some good practice in? Join this <a href="https://www.the-odd-dataguy.com/2022/04/28/dvc_kaggle/" target="_blank" rel="nofollow noopener noreferrer">Kaggle competition</a> created by <a href="https://www.linkedin.com/in/jeanmicheldaignan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jean-Michel Daignan</strong></a> based on a previous competition from Petfinder.my with some really cute pet images.</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1118538f2983017664997ed3c565e62d/39600/img_pawpularity.png" alt="DVC Kaggle Competition" title="DVC Kaggle Competition" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC Kaggle Competition based on Pawfinder.my (<a href="https://www.the-odd-dataguy.com/2022/04/28/dvc_kaggle/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We love it when our Community does conference talks on our tools! 🥰</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">The <a href="https://twitter.com/EmbVisionSummit">@EmbVisionSummit</a> starts on Monday and our team is on its way!🚀<br><br>We’ve had our fair share of experience on edge devices. Nicolás and our CTO <a href="https://twitter.com/dekked_">@dekked_</a> will be there; come by to chat about our experiences.<br><br>Also, don't miss Nico's talk! May 18th - 2:05pm <a href="https://t.co/MfnEtOT29Y">https://t.co/MfnEtOT29Y</a> <a href="https://t.co/r9itWhVjis">pic.twitter.com/r9itWhVjis</a></p>— Tryolabs (@tryolabs) <a href="https://twitter.com/tryolabs/status/1525103969885888512">May 13, 2022</a></blockquote> <p>This Heartbeat was brought to you by the song "Tarkus" from Emerson, Lake, and Palmer which can be found on our <a href="https://open.spotify.com/playlist/3eahsf3T9iEJkfWECC7VEp?si=cbcf1f9d3e424d62" target="_blank" rel="nofollow noopener noreferrer">MLOps Playlist,</a> and the letters <strong>T, P, and I.</strong> 😉 See you next month!</p> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/local-experiments-to-cloud-with-tpihttps://dvc.org/blog/local-experiments-to-cloud-with-tpiThu, 12 May 2022 00:00:00 GMT<p>There are many reasons you might train a machine learning model locally. Mainly, it's quick & easy to set up a new project on a local machine. This is sufficient for simple experiments (with reduced data subsets or small models) without paying to rent heavy cloud compute resources. A local machine is also deeply familiar — as opposed to the multitude of available cloud services, which can be intimidating even with a decent background in DevOps.</p> <p>Once you locally set up and iterate over your data & code enough, you may reach a point where more powerful compute resources are needed to train a larger model and/or use bigger datasets. In other words, you might have to switch from experimenting locally to a cloud environment. If you find yourself in this situation, this tutorial will help you easily provision cloud infrastructure with Terraform and run your existing training script on it.</p> <h2 id="getting-started" style="position:relative;">Getting Started<a href="#getting-started" aria-label="getting started permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This tutorial uses the <a href="https://www.kaggle.com/jenny18/honey-bee-annotated-images" target="_blank" rel="nofollow noopener noreferrer">BeeImage Dataset</a> which contains over 5,100 bee images annotated with location, date, time, subspecies, health condition, caste, and pollen. Let's assume we've downloaded the images, created a project, and trained a <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network" target="_blank" rel="nofollow noopener noreferrer">convolutional neural network</a> model locally to classify different subspecies. If you want to follow along, you can use your own data and training code, or clone <a href="https://github.com/iterative/blog-tpi-bees" target="_blank" rel="nofollow noopener noreferrer">the example repository</a>.</p> <p>How do we continue iterating on our model in the cloud? Can we run more epochs overnight? Change some hyperparameters? Add more layers? The first question when planning <em>The Big Move</em> is "what dependencies are needed to train this model in a cloud environment?"</p> <p>Some of the important puzzle pieces you already have locally:</p> <ul> <li>Your training code. It is likely that you have a <a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">whole pipeline</a> with multiple stages but for the sake of simplicity, this tutorial uses a single <code>train.py</code> script.</li> <li>Data.</li> <li>Python environment with all required libraries.</li> </ul> <p>You will also need an account with your cloud provider of choice. In this tutorial we'll be provisioning infrastructure on <a href="https://aws.amazon.com/" target="_blank" rel="nofollow noopener noreferrer">Amazon Web Services (AWS)</a>. You can create an AWS account yourself, or ask your DevOps team to provide you with one.</p> <admon type="info"> <p>Make sure to insert <a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#amazon-web-services" target="_blank" rel="nofollow noopener noreferrer">authentication credentials</a> into your system's environment variables (<code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code>).</p> </admon> <p>We can now start the move with the help of <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider Iterative (TPI)</a>.</p> <h2 id="what-is-terraform" style="position:relative;">What is Terraform?<a href="#what-is-terraform" aria-label="what is terraform permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <admon type="info"> <p><a href="https://www.terraform.io" target="_blank" rel="nofollow noopener noreferrer">Terraform</a> is an open-source infrastructure-as-code tool that you should <a href="https://www.terraform.io/downloads" target="_blank" rel="nofollow noopener noreferrer">download and install</a> for this tutorial.</p> </admon> <p>Terraform requires us to create a configuration file containing a declarative description of the infrastructure we need. There's no need to read lots of cloud documentation nor write lots of commands. Instead, you describe what your infrastructure should ultimately look like. Behind the scenes, Terraform will figure out what needs to be done. If you've cloned the <a href="https://github.com/iterative/blog-tpi-bees" target="_blank" rel="nofollow noopener noreferrer">repository</a>, the <code>main.tf</code> configuration file is in the project's root. We'll explain its contents below.</p> <h2 id="terraform-provider-iterative-tpi" style="position:relative;">Terraform Provider Iterative (TPI)<a href="#terraform-provider-iterative-tpi" aria-label="terraform provider iterative tpi permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Terraform can orchestrate a plethora of various resources for you, but for the majority of projects you only need a few. Instead of shipping plugins (providers) for all these resources in one bundle, Terraform downloads <a href="https://www.terraform.io/docs/extend/how-terraform-works.html" target="_blank" rel="nofollow noopener noreferrer"><em>providers</em></a> whenever required.</p> <p>For this tutorial we will only need <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">TPI</a>. It enables full lifecycle management of computing resources from AWS, Microsoft Azure, Google Cloud Platform, and Kubernetes. TPI provisions infrastructure, sync data, and also executes your scripts — all via a single configuration file. It has a several super neat features:</p> <ul> <li>The configuration for different cloud compute providers is nearly identical, so you can easily migrate from one cloud provider to another.</li> <li>It syncs data to and from the remote cloud and your local machine.</li> <li>It allows you to use low-cost spot instances, and even automatically respawns interrupted instances, restoring working directories/data and resuming running tasks in the cloud even if you are offline.</li> <li>Once your training is complete, the remote resources will be terminated, avoiding unused machines quietly ramping up costs.</li> </ul> <p>To start using TPI we need to let Terraform know about it by writing this in our <code>main.tf</code>:</p> <div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span> <span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span></code></pre></div> <p>Once we describe what providers we need, run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> init</span></code></pre></div> <admon type="info"> <p>If you have cloned the example repository, you should run this command before doing anything else. This will initialize your working directory and download the required provider(s).</p> </admon> <admon type="tip"> <p>It's probably also a good idea to set the logging level to see helpful info on progress:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">TF_LOG_PROVIDER</span><span class="token operator">=</span>INFO</span></code></pre></div> </admon> <h2 id="configuring-iterative_task" style="position:relative;">Configuring <code>iterative_task</code><a href="#configuring-iterative_task" aria-label="configuring iterative_task permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>TPI offers a single resource called <code>iterative_task</code> that we'll need to configure. This resource will:</p> <ol> <li>Create cloud resources (storage, machines) for the task.</li> <li>If specified, upload a local working directory to the cloud storage.</li> <li>Run the given script in the cloud until completion, error, or timeout.</li> <li>If specified, download output results.</li> <li>Automatically terminate compute resources upon task completion.</li> </ol> <p>This is exactly what we need to run a model training process! Let's see the <code>iterative_task</code> in the <code>main.tf</code> file before delving into the details:</p> <div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span> <span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span> <span class="token keyword">resource <span class="token type variable">"iterative_task"</span></span> <span class="token string">"example-basic"</span> <span class="token punctuation">{</span> <span class="token property">cloud</span> <span class="token punctuation">=</span> <span class="token string">"aws"</span> <span class="token comment"># or any of: gcp, az, k8s</span> <span class="token property">machine</span> <span class="token punctuation">=</span> <span class="token string">"m"</span> <span class="token comment"># medium. Or any of: l, xl, m+k80, xl+v100, ...</span> <span class="token property">spot</span> <span class="token punctuation">=</span> <span class="token number">0</span> <span class="token comment"># auto-price. Default -1 to disable, or >0 for hourly USD limit</span> <span class="token property">timeout</span> <span class="token punctuation">=</span> <span class="token number">24</span>*<span class="token number">60</span>*<span class="token number">60</span> <span class="token comment"># 24h</span> <span class="token property">image</span> <span class="token punctuation">=</span> <span class="token string">"ubuntu"</span> <span class="token keyword">storage</span> <span class="token punctuation">{</span> <span class="token property">workdir</span> <span class="token punctuation">=</span> <span class="token string">"src"</span> <span class="token property">output</span> <span class="token punctuation">=</span> <span class="token string">"results-basic"</span> <span class="token punctuation">}</span> <span class="token property">environment</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">TF_CPP_MIN_LOG_LEVEL</span> <span class="token punctuation">=</span> <span class="token string">"1"</span> <span class="token punctuation">}</span> <span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END #!/bin/bash sudo apt-get update -q sudo apt-get install -yq python3-pip pip3 install -r requirements.txt tensorflow-cpu==2.8.0 python3 train.py --output results-basic/metrics.json END</span> <span class="token punctuation">}</span></code></pre></div> <p>Every Terraform resource needs a name; here it's <code>example-basic</code>. This name is only used within the configuration file and it can be whatever you want. Inside of the resource block, we specify some arguments:</p> <ul> <li><em>cloud</em> (<strong>required</strong>): cloud provider to run the task on. This can be <code>aws</code>, <code>gcp</code>, <code>az</code>, or <code>k8s</code>.</li> <li><em>machine</em>: if you know the exact kind of machine that you'd like to use, you can specify it here. Alternatively, <a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type" target="_blank" rel="nofollow noopener noreferrer">TPI offers some common machine types</a> which are roughly the same for all supported clouds. For example, <code>m+t4</code> means "Medium, with (at least) 4 CPU cores, 16 GB RAM, and 1 NVIDIA Tesla T4 GPU device".</li> <li><em>spot</em>: set the <a href="https://aws.amazon.com/ec2/spot/pricing/" target="_blank" rel="nofollow noopener noreferrer">spot instance price</a>. Here we use <code>0</code> for automatic pricing, which should keep costs down. Alternatively you can specify a positive number to set a maximum bidding price in USD per hour, or <code>-1</code> to use on-demand pricing.</li> <li><em>timeout</em>: maximum time to run before the instance is force-terminated. This prevents forgotten long-running instances draining your budget.</li> <li><em>image</em>: the container to use (in our case, Ubuntu LTS 20.04).</li> <li><em>workdir</em>: a directory on your local machine relative to your project folder which you would like to upload with the remote machine. This way you can share your whole project or parts of it with a remote machine.</li> <li><em>output</em>: a directory <strong>relative to <code>workdir</code></strong> to download after the task in complete.</li> <li><em>script</em> (<strong>required</strong>): this is where TPI's magic happens, i.e. what commands to run in <code>workdir</code> on the provisioned cloud instance.</li> </ul> <admon type="tip"> <p>See the <a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#argument-reference" target="_blank" rel="nofollow noopener noreferrer">resource arguments documentation</a> for a full list.</p> </admon> <admon type="warn"> <p>Keep in mind the <a href="https://aws.amazon.com/ec2/pricing/" target="_blank" rel="nofollow noopener noreferrer">the running costs of AWS EC2 instances</a>. The <code>machine</code> used in the example above is not included in the free tier and will incur charges. Using TPI's <code>spot</code> pricing will keep costs to a minimum (roughly $0.15/hour for <code>m+t4</code> on AWS), but not eliminate them entirely.</p> </admon> <p>In the simplest scenario, all we need to do on a new machine to run the training <code>script</code> is to set up the Python environment with required libraries. If you simply want to train your model on a machine with more memory, this may be enough. However, if you want your training code to leverage GPUs, we can make a few small tweaks:</p> <h2 id="training-with-gpu" style="position:relative;">Training with GPU<a href="#training-with-gpu" aria-label="training with gpu permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are several ways you can leverage GPU devices on a remote machine. You can install all the required drivers and dependencies "manually" via a script, you can use an existing Docker image, build your own, or just use the convenient <code>nvidia</code> image pre-packaged with CUDA 11.3 GPU drivers.</p> <div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span> <span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span> <span class="token keyword">resource <span class="token type variable">"iterative_task"</span></span> <span class="token string">"example-gpu"</span> <span class="token punctuation">{</span> <span class="token property">cloud</span> <span class="token punctuation">=</span> <span class="token string">"aws"</span> <span class="token property">machine</span> <span class="token punctuation">=</span> <span class="token string">"m+t4"</span> <span class="token comment"># 4 CPUs and an NVIDIA Tesla T4 GPU</span> <span class="token property">spot</span> <span class="token punctuation">=</span> <span class="token number">0</span> <span class="token property">timeout</span> <span class="token punctuation">=</span> <span class="token number">24</span>*<span class="token number">60</span>*<span class="token number">60</span> <span class="token property">image</span> <span class="token punctuation">=</span> <span class="token string">"nvidia"</span> <span class="token comment"># has CUDA GPU drivers</span> <span class="token keyword">storage</span> <span class="token punctuation">{</span> <span class="token property">workdir</span> <span class="token punctuation">=</span> <span class="token string">"src"</span> <span class="token property">output</span> <span class="token punctuation">=</span> <span class="token string">"results-gpu"</span> <span class="token punctuation">}</span> <span class="token property">environment</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">TF_CPP_MIN_LOG_LEVEL</span> <span class="token punctuation">=</span> <span class="token string">"1"</span> <span class="token punctuation">}</span> <span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END #!/bin/bash sudo apt-get update -q sudo apt-get install -yq python3-pip pip3 install -r requirements.txt tensorflow==2.8.0 python3 train.py --output results-gpu/metrics.json END</span> <span class="token punctuation">}</span></code></pre></div> <h2 id="ready-set-apply" style="position:relative;">Ready… Set… Apply!<a href="#ready-set-apply" aria-label="ready set apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Whether you want to go with the basic example, or the GPU-enabled training, you can run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> apply</span></code></pre></div> <p>to review what steps Terraform is going to take to provision your desired infrastructure. Don't worry, nothing is actually done at this point, but it's a good way to check for potential issues in the configuration. You'll need to type <code>yes</code> to confirm.</p> <p>At this point you can go offline, get a cup of your preferred beverage, and let TPI work its magic together with Terraform. They will allocate a remote machine for you, upload you data and script, and run your code. Once the script finishes, the machine will be terminated.</p> <p>You can monitor what's going on at any point by running:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> refresh </span><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> show</span></code></pre></div> <p>This will print the logs and script's output. Once you see that the task has successfully finished, run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">terraform</span> destroy</span></code></pre></div> <p>to sync back your shared files and tear down all remote objects managed by your configuration. If you output results (e.g. <code>results-gpu/metrics.json</code>), they'll be synced back to your local machine.</p> <p>Now if you want to try another experiment, you can change your code, run <code>terraform apply</code> again, and when the training is finished, commit your code together with the updated results. This can help you move from prototyping locally to leveraging more powerful cloud instances without the hassle of full MLOps setup. At the same time, once you're ready to start working on your <a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">production pipelines and CI/CD</a>, this <code>main.tf</code> codification should also make the transition smoother.</p> <p>In this tutorial we covered the simplest example with no GPU, and one with GPUs. In many cases, deploying your pipelines would be easier with your own Docker image (both for prototyping and for production) and CI/CD workflows. If you'd like to learn how to create your own Docker images and use them with TPI, see <a href="https://dvc.org/blog/local-experiments-to-cloud-with-tpi-docker">part 2</a> of this blog post!</p>https://dvc.org/blog/end-to-end-computer-vision-api-part-3-remote-exp-ci-cdhttps://dvc.org/blog/end-to-end-computer-vision-api-part-3-remote-exp-ci-cdMon, 09 May 2022 00:00:00 GMT<h3 id="leveraging-cloud-resources-with-cicd-and-cml" style="position:relative;">Leveraging Cloud Resources with CI/CD and CML<a href="#leveraging-cloud-resources-with-cicd-and-cml" aria-label="leveraging cloud resources with cicd and cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you use the <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML library</a> in combination with CI/CD tools like GitHub Actions or GitLab CI/CD, you can quickly and easily:</p> <ol> <li>provision a powerful virtual machine (VM) in the cloud as training Computer Vision (CV) models often requires powerful GPUs rarely available on local machines</li> <li>submit your ML training job to it</li> <li>save the results (metrics, models and other training artifacts)</li> <li>automatically shut down the VM without having to worry about excessive cloud bills</li> </ol> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 460px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/300c88b3b1b5f65753629d661cc916e5/39600/cicd4ml.png" alt="Continuous Integration and Deployment for Machine Learning" title="Continuous Integration and Deployment for Machine Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Continuous Integration and Deployment for Machine Learning</em></p> <p>We've configured three <a href="https://github.com/iterative/magnetic-tiles-defect/tree/main/.github/workflows" target="_blank" rel="nofollow noopener noreferrer">workflow files</a> for GitHub Actions, each of which corresponds to a particular stage depending on the project's lifecycle we are in:</p> <h4 id="1-workflow-for-experimentation-and-hyperparameter-tuning" style="position:relative;">1. <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.github/workflows/1-experiment.yaml" target="_blank" rel="nofollow noopener noreferrer">Workflow for experimentation and hyperparameter tuning</a><a href="#1-workflow-for-experimentation-and-hyperparameter-tuning" aria-label="1 workflow for experimentation and hyperparameter tuning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/361303930f78e3aafca3884430da2e6d/39600/workflow_exp.png" alt="Workflow for experimentation and hyperparameter tuning" title="Workflow for experimentation and hyperparameter tuning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Workflow for experimentation and hyperparameter tuning</em> In this stage, we'll create an experiment branch so that can experiment with data preprocessing, change model architecture, tune hyperparameters, etc. Once we think our experiment is ready to be run, we'll push our changes to a remote repository (in this case, GitHub). This push will trigger a CI/CD job in GitHub Actions, which in turn will:</p> <p>a) provision an EC2 virtual machine with a GPU in AWS:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner on AWS EC2 <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml runner \ --cloud=aws \ --cloud-region=us-east-1 \ --cloud-type=g4dn.xlarge \ --labels=cml-runner</span></code></pre></div> <p>b) deploy our experiment branch to a Docker container on this machine:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train-model</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span> <span class="token key atrule">container</span><span class="token punctuation">:</span> <span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token key atrule">options</span><span class="token punctuation">:</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>gpus all <span class="token key atrule">environment</span><span class="token punctuation">:</span> cloud <span class="token key atrule">permissions</span><span class="token punctuation">:</span> <span class="token key atrule">contents</span><span class="token punctuation">:</span> read <span class="token key atrule">id-token</span><span class="token punctuation">:</span> write <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2</code></pre></div> <p>c) rerun the entire DVC pipeline and push metrics back to GitHub:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> dvc<span class="token punctuation">-</span>repro<span class="token punctuation">-</span>cml <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> # Install dependencies pipenv install --skip-lock pipenv run dvc pull pipenv run dvc exp run pipenv run dvc push</span></code></pre></div> <p>d) open a pull request and post a report to it that contains a table with metrics and model outputs on a few test images:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token comment"># Open a pull request</span> cml <span class="token function">pr</span> dvc.lock metrics.json training_metrics.json training_metrics_dvc_plots/** <span class="token comment"># Create CML report</span> <span class="token builtin class-name">echo</span> <span class="token string">"## Metrics"</span> <span class="token operator">></span> report.md pipenv run dvc metrics show <span class="token parameter variable">--md</span> <span class="token operator">>></span> report.md <span class="token builtin class-name">echo</span> <span class="token string">"## A few random test images"</span> <span class="token operator">>></span> report.md <span class="token keyword">for</span> <span class="token for-or-select variable">file</span> <span class="token keyword">in</span> <span class="token variable"><span class="token variable">$(</span><span class="token function">ls</span> data/test_preds/ <span class="token operator">|</span> <span class="token function">sort</span> <span class="token parameter variable">-R</span> <span class="token operator">|</span> <span class="token function">tail</span> <span class="token parameter variable">-20</span><span class="token variable">)</span></span><span class="token punctuation">;</span> <span class="token keyword">do</span> cml publish data/test_preds/<span class="token variable">$file</span> <span class="token parameter variable">--md</span> <span class="token operator">>></span> report.md <span class="token keyword">done</span> cml send-comment <span class="token parameter variable">--pr</span> <span class="token parameter variable">--update</span> report.md</code></pre></div> <p>The report structure is fully customizable. Below is an example of what the PR and the CML report would look like in this case. The test images show (from left to right) input images, ground truth masks and prediction masks.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1df47be6188799d34f3c0cc7678b0be4/39600/pr_cml_report.png" alt="PR and CML report" title="PR and CML report" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>PR and CML report</em></p> <p>At this point, we can assess the results in Iterative Studio and GitHub and decide whether we want to accept the PR or keep experimenting.</p> <h4 id="2-workflow-for-deploying-to-the-development-environment" style="position:relative;">2. <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.github/workflows/2-develop.yaml" target="_blank" rel="nofollow noopener noreferrer">Workflow for deploying to the development environment</a><a href="#2-workflow-for-deploying-to-the-development-environment" aria-label="2 workflow for deploying to the development environment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/caff74247266358cdd3fb30f73a47aac/39600/workflow_dev.png" alt="Workflow for deploying to the development environment" title="Workflow for deploying to the development environment" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Workflow for deploying to the development environment</em> Once we are happy with our model's performance on the experiment branch, we can merge it into the development branch. This would trigger a different CI/CD job that will:</p> <p>a) retrain the model if the <code>dev</code> branch contains changes not present in the <code>exp</code> branch. DVC will skip this stage if that's not the case. This step looks almost identical to step (1.c) above (rerunning the pipeline & reporting metrics on GitHub) in the above workflow.</p> <p>b) deploy the web REST API application (that incorporates the new model) to a development endpoint on Heroku:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">deploy-dev-api</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>and<span class="token punctuation">-</span>push <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/download<span class="token punctuation">-</span>artifact@master <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> model_pickle <span class="token key atrule">path</span><span class="token punctuation">:</span> models <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> akhileshns/heroku<span class="token punctuation">-</span>[email protected] <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">heroku_api_key</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span>secrets.HEROKU_API_KEY<span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">heroku_app_name</span><span class="token punctuation">:</span> demo<span class="token punctuation">-</span>api<span class="token punctuation">-</span>mag<span class="token punctuation">-</span>tiles<span class="token punctuation">-</span>dev <span class="token key atrule">heroku_email</span><span class="token punctuation">:</span> <span class="token string">'[email protected]'</span> <span class="token key atrule">team</span><span class="token punctuation">:</span> iterative<span class="token punctuation">-</span>sandbox <span class="token key atrule">usedocker</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div> <p>The development endpoint is now accessible at</p> <p><a href="https://demo-api-mag-tiles-dev.herokuapp.com/analyze" target="_blank" rel="nofollow noopener noreferrer">https://demo-api-mag-tiles-dev.herokuapp.com/analyze</a> (note <code>-dev</code>),</p> <p>and we can use it to assess the end-to-end performance of the overall solution. If we pick a random test image <code>exp3_num_258558.jpg</code>, <span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 252px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5be68c36cd3947c4aa3c4c042639eece/88c24/exp3_num_258558.jpg" alt="Test image exp3_num_258558.jpg" title="Test image exp3_num_258558.jpg" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Test image <code>exp3_num_258558.jpg</code></em></p> <p>we can send it to the endpoint using the <code>curl</code> command like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">curl</span> <span class="token parameter variable">-F</span> <span class="token string">'image=@data/MAGNETIC_TILE_SURFACE_DEFECTS/test_images/exp3_num_258558.jpg'</span> <span class="token punctuation">\</span> <span class="token parameter variable">-v</span> https://demo-api-mag-tiles-dev.herokuapp.com/analyze</span></code></pre></div> <p>This will return some http-header info and the body of the response containing the defect segmentation mask (<code>0</code> for pixel locations without defects and <code>1</code> otherwise):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">* Trying 18.208.60.216:443... * Connected to demo-api-mag-tiles-dev.herokuapp.com (18.208.60.216) port 443 (#0) ... {"pred":[[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,</code></pre></div> <p>Alternatively, we can do a similar thing with a Python script that also saves the output mask into a <code>exp3_num_258558_mask.png</code> image:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> json <span class="token keyword">from</span> pathlib <span class="token keyword">import</span> Path <span class="token keyword">import</span> matplotlib<span class="token punctuation">.</span>cm <span class="token keyword">as</span> cm <span class="token keyword">import</span> matplotlib<span class="token punctuation">.</span>pyplot <span class="token keyword">as</span> plt <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np <span class="token keyword">import</span> requests url <span class="token operator">=</span> <span class="token string">'https://demo-api-mag-tiles-dev.herokuapp.com/analyze'</span> file_path <span class="token operator">=</span> Path<span class="token punctuation">(</span> <span class="token string">'data/MAGNETIC_TILE_SURFACE_DEFECTS/test_images/exp3_num_258558.jpg'</span><span class="token punctuation">)</span> files <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">'image'</span><span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>file_path<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">open</span><span class="token punctuation">(</span>file_path<span class="token punctuation">,</span> <span class="token string">'rb'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"image/jpeg"</span><span class="token punctuation">)</span><span class="token punctuation">}</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>url<span class="token punctuation">,</span> files<span class="token operator">=</span>files<span class="token punctuation">)</span> data <span class="token operator">=</span> json<span class="token punctuation">.</span>loads<span class="token punctuation">(</span>response<span class="token punctuation">.</span>content<span class="token punctuation">)</span> pred <span class="token operator">=</span> np<span class="token punctuation">.</span>array<span class="token punctuation">(</span>data<span class="token punctuation">[</span><span class="token string">'pred'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> plt<span class="token punctuation">.</span>imsave<span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f'</span><span class="token interpolation"><span class="token punctuation">{</span>file_path<span class="token punctuation">.</span>stem<span class="token punctuation">}</span></span><span class="token string">_mask.png'</span></span><span class="token punctuation">,</span> pred<span class="token punctuation">,</span> cmap<span class="token operator">=</span>cm<span class="token punctuation">.</span>gray<span class="token punctuation">)</span></code></pre></div> <p>Below you can see what this mask looks like. <span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 252px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5df849e0bdc1e259cd2f1286c1636d68/019e0/exp3_num_258558_mask.png" alt="Output mask exp3_num_258558_mask.png" title="Output mask exp3_num_258558_mask.png" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Output mask <code>exp3_num_258558_mask.png</code></em></p> <p>Before we merge the dev branch into the main branch, we would need to thoroughly test and monitor the application in the development environment. A good test could be duplicating real image requests to the dev endpoint for some period of time and assess the quality of the returned segmentation masks.</p> <h4 id="3-workflow-for-deploying-to-the-production-environment" style="position:relative;">3. <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.github/workflows/3-deploy.yaml" target="_blank" rel="nofollow noopener noreferrer">Workflow for deploying to the production environment</a><a href="#3-workflow-for-deploying-to-the-production-environment" aria-label="3 workflow for deploying to the production environment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h4> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 400px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/37c9afef843976da2bd2d8d6cf1c744a/39600/workflow_prod.png" alt="Workflow for deploying to the production environment" title="Workflow for deploying to the production environment" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Workflow for deploying to the production environment</em></p> <p>If there are no issues and we are confident in the quality of the new model, we can merge the development branch into the main branch of our repository. Again, this triggers the third CI/CD workflow that deploys the code from the main branch to the production API. This looks identical to the deployment into the development environment, except now the deployment endpoint will be</p> <p><a href="https://demo-api-mag-tiles-prod.herokuapp.com/analyze" target="_blank" rel="nofollow noopener noreferrer">https://demo-api-mag-tiles-prod.herokuapp.com/analyze</a> (note <code>-prod</code>).</p> <h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In this series of posts (see <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines">Part 1</a> and <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments">Part 2</a>), we described how we addressed the problem of building a Computer Vision Web API for defect detection. We’ve chosen this approach because it addresses the common challenges that are shared across many CV projects: how to version datasets that consist of a large number of small- to medium-sized files; how to avoid triggering long-running stages of an ML pipeline when it’s not needed for reproducibility; how to run model training jobs on the cloud infrastructure without having to provision and manage everything yourself; and, finally, how to track progress in key metrics when you run many ML experiments.</p> <p>We've talked about the following:</p> <ul> <li>Common difficulties when building Computer Vision Web API for defect detection (<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#introduction">link</a>)</li> <li>Pros and cons of exploratory work in Jupyter Notebooks (<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#proof-of-concept-in-jupyter-notebooks">link</a>)</li> <li>Versioning data in remote storage with DVC (<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#data-versioning">link</a>)</li> <li>Moving and refactoring the code from Jupyter Notebooks into DVC pipeline stages (<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines#refactoring-jupyter-code-into-an-ml-pipeline">link</a>)</li> <li>Experiment management and versioning (<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments#experiment-management">link</a>)</li> <li>Visualization of experiments and collaboration in Iterative Studio (<a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments#collaboration-and-reporting-with-iterative-studio">link</a>)</li> <li>Remote experiments, CI/CD, and production deployment (this post)</li> </ul> <h2 id="what-to-try-next" style="position:relative;">What to Try Next<a href="#what-to-try-next" aria-label="what to try next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Missed the previous parts of this post? See <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines">Part 1: Data Versioning and ML Pipelines</a> and <a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experiments">Part 2: Local Experiments</a>.</p> <ul> <li>Reproduce this solution by setting your own configs, tokens, and access keys for GitHub, AWS, and Heroku</li> <li>Add a check to merge PRs automatically if the metrics have improved</li> <li>Add a few simple unit tests and insert them into CML workflow files so they run before reproducing the pipeline</li> <li>Apply this approach to a different Computer Vision problem using a different dataset or different problem type (image classification, object detection, etc.)</li> </ul>https://dvc.org/blog/CML-runners-saving-models-2https://dvc.org/blog/CML-runners-saving-models-2Fri, 06 May 2022 00:00:00 GMT<p>In <a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">part 1 of this guide</a> we showed how you can use CML to provision an AWS EC2 instance to train your model before saving the model to our Git repository. In doing so, we allowed ourselves to terminate the training instance without losing our model altogether.</p> <p>This worked perfectly fine for the simple model we trained, but this approach is not optimal when dealing with larger models. GitHub starts warning you at 50MB files and simply <a href="https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github" target="_blank" rel="nofollow noopener noreferrer">won't upload anything over 100MB</a>. <a href="https://docs.gitlab.com/ee/user/gitlab_com/index.html#account-and-limit-settings" target="_blank" rel="nofollow noopener noreferrer">GitLab similarly limits</a> the size of files you can store in your repository. A beefy XGBoost model can easily exceed 100MB and a neural network can go up into the gigabytes.</p> <p>That means we cannot save these models directly to our repository. Luckily we can look towards another one of Iterative's open-source tools: <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>. DVC includes a lot of features for managing machine learning projects, such as ML pipelines, experiment tracking, and data versioning. In this guide we will zoom in on just one of those features: remote storage.</p> <p>We can use DVC to save our model to a remote storage location, such as M3, HDFS, an SFTP server, or even Google Drive. Much like Git tracks changes to your code, DVC tracks changes to your data. It puts a reference to a specific version of your data in the Git commit. That way your code is linked to a specific version of your model, without containing the actual model.</p> <p>In this part 2, we will show you how to save the model we trained in part 1 to a DVC remote. At the end of this guide our CML workflow will be doing the folowing on a daily basis:</p> <ol> <li>Provision an Amazon Web Services (AWS) EC2 instance</li> <li>Train the model</li> <li>Save the model to a DVC remote storage on Google Drive</li> <li>Save the model metrics to a GitHub repository</li> <li>Create a merge request with the new outputs</li> <li>Terminate the AWS EC2 instance</li> </ol> <p>All files needed for this guide can be found in <a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">this repository</a>.</p> <admon type="tip"> <p>We will be using Google Drive as our remote storage. With slight modifications, however, you can also use other remotes such as AWS S3, GCP Cloud Storage, and Azure Storage. Please <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">refer to the DVC Docs</a> for more details.</p> </admon> <h1 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Make sure to have followed <a href="https://dvc.org/blog/CML-runners-saving-models-1" target="_blank" rel="nofollow noopener noreferrer">part 1 of this guide</a> and have gotten CML up and running. The necessary files for all of this can be found in <a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">this repository</a>. Additionally, set up the following things beforehand:</p> <ul> <li><a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">Install DVC</a></li> <li><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#using-a-custom-google-cloud-project-recommended" target="_blank" rel="nofollow noopener noreferrer">Set up a GCP project</a></li> <li><a href="https://console.cloud.google.com/apis/library/drive.googleapis.com" target="_blank" rel="nofollow noopener noreferrer">Enable the Google Drive API for your GCP project</a></li> <li><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#using-service-accounts" target="_blank" rel="nofollow noopener noreferrer">Create a GCP service account and download the private key to a safe location</a></li> <li><a href="https://support.google.com/drive/answer/2375091?hl=en&co=GENIE.Platform%3DDesktop" target="_blank" rel="nofollow noopener noreferrer">Create a Google Drive directory to save your model to</a></li> <li><a href="https://support.google.com/drive/answer/7166529?hl=en&co=GENIE.Platform%3DDesktop" target="_blank" rel="nofollow noopener noreferrer">Grant the service account editor permissions to the Drive directory by sharing it with the service account's email address</a></li> </ul> <h1 id="setting-up-our-dvc-remote" style="position:relative;">Setting up our DVC remote<a href="#setting-up-our-dvc-remote" aria-label="setting up our dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>When first using DVC in a project, you need to initialize DVC by running <a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a>. This will create the structure DVC uses to keep track of versioning, and ensures Git will not be tracking the files in the DVC repository. Instead, Git will henceforth include a list of references to those files. Make sure to commit the initialization to Git.</p> <p>Then, in order to start using DVC for versioning, we need to set up a remote. This is where our model files will end up, while DVC keeps track of their respective versions. Here we will be using Google Drive as our remote.</p> <p><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#setup-a-google-drive-dvc-remote" target="_blank" rel="nofollow noopener noreferrer">The DVC user guide</a> explains how to set up a remote on Google Drive. If you would rather use another remote, you can <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">find instructions here</a>. In that case make sure to also update the DVC dependency in <code>requirements.txt</code>.</p> <p>While DVC doesn't require a service account to work, we do need one in the set-up we're aiming for. That's because without a service account we will need to authorize ourselves through a log-in page every time. Our self-hosted runner would get stuck on this page because we cannot authorize ourselves there.</p> <p>In order to let DVC access the Google Drive folder we created from our runner, we need to add two more GitHub Actions secrets: <code>GDRIVE_CREDENTIALS_DATA</code> and <code>GOOGLE_DRIVE_URI</code>. The first one should contain the private key we downloaded when setting up our service account (i.e. the <code>.json</code> file). The second one should be the <a href="https://cloud.google.com/bigquery/external-data-drive" target="_blank" rel="nofollow noopener noreferrer">Drive URI</a> to the directory we created in Google Drive (i.e. the sequence of random characters at the end of our Google Drive URL).</p> <h1 id="export-the-model-to-a-dvc-remote" style="position:relative;">Export the model to a DVC remote<a href="#export-the-model-to-a-dvc-remote" aria-label="export the model to a dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Now that we have set up the remote and made sure GitHub Actions has all the details needed to access the remote, we can use the workflow below. In this scenario, we train the model in the same way as in part 1, but we push it to the DVC remote. A reference to the location of this file is added to the GitHub repository (<code>model/random_forest.joblib.dvc</code>). The model itself is added to <code>.gitignore</code> and not pushed to the repository.</p> <p>The other files created in <code>train.py</code> are still pushed to an experiment branch in GitHub. Afterwards a merge request is created.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> CML with DVC <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token comment"># Here we use two triggers: manually and daily at 08:00</span> <span class="token key atrule">workflow_dispatch</span><span class="token punctuation">:</span> <span class="token key atrule">schedule</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">cron</span><span class="token punctuation">:</span> <span class="token string">'0 8 * * *'</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner on EC2 <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml runner \ --cloud=aws \ --cloud-region=eu-west \ --cloud-type=t2.micro \ --labels=cml-runner \ --single</span> <span class="token key atrule">train-model</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span> <span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">120</span> <span class="token comment"># 2h</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>node@v3 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">node-version</span><span class="token punctuation">:</span> <span class="token string">'16'</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">GDRIVE_CREDENTIALS_DATA</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GDRIVE_CREDENTIALS_DATA <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml ci pip install -r requirements.txt</span> python get_data.py python train.py <span class="token comment"># Connect to your DVC remote storage and push the model to there</span> dvc add model/random_forest.joblib <span class="token comment"># This automatically adds the model to your .gitignore</span> dvc remote add <span class="token punctuation">-</span>d <span class="token punctuation">-</span>f myremote gdrive<span class="token punctuation">:</span>//$<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GOOGLE_DRIVE_URI <span class="token punctuation">}</span><span class="token punctuation">}</span> dvc remote modify myremote gdrive_use_service_account true dvc push <span class="token comment"># Create pull request for the remaining files</span> cml pr . <span class="token comment"># Create CML report</span> cat model/metrics.txt <span class="token punctuation">></span> report.md cml publish model/confusion_matrix.png <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md cml send<span class="token punctuation">-</span>comment <span class="token punctuation">-</span><span class="token punctuation">-</span>pr <span class="token punctuation">-</span><span class="token punctuation">-</span>update report.md</code></pre></div> <p>And that's it! We have broadly the same set-up as outlined in part 1 of this guide, but we no longer use our GitHub repository for storing our model. Instead, the model is now saved to Google Drive, which allows for much larger models.</p> <admon type="tip"> <p>In a situation where we retrain our model daily based on the most recent data, it would make sense to also use DVC to keep track of the data used in each training. We could, for example, use our runner to import our training data from a table in our database and write both the data and the model to the DVC remote. This is beyond the scope of this guide, but <a href="https://github.com/iterative/cml_dvc_case" target="_blank" rel="nofollow noopener noreferrer">here you can find a repository that covers this</a>.</p> </admon> <admon type="tip"> <p>While we have achieved our goal of using DVC for our model storage, there are some other DVC features we could benefit from as well. We could define a reproducible pipeline to replace our manual <code>get_data.py</code> and <code>train.py</code> tasks. <a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">Here you can find</a> more information on how to achieve this.</p> </admon> <h1 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>As we saw in <a href="https://dvc.org/blog/CML-runners-saving-models-1">part 1 of this guide</a>, we can use CML to automate a periodical retraining of our models on a self-hosted runner. We were able to save the model to our GitHub repository, but that approach has its limitations with regards to model size.</p> <p>In this part 2 we worked around those limitations by saving our model to a DVC remote instead. We set up Google Drive as our remote and adapted our CML workflow to save our models there. All in all, we can now automatically (re)train models using a self-hosted runner, track different model versions in Git, and save models to a remote storage such as Google Drive for future reference.</p> <p>A great extension of our CI/CD would be a <code>deploy</code> step to bring the latest version of our model into production. This step might be conditional on the performance of the model; we could decide to only start using it in production if it performs better than previous iterations. All of this warrants a guide of its own, however, so look out for that in the future! 😉</p>https://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experimentshttps://dvc.org/blog/end-to-end-computer-vision-api-part-2-local-experimentsThu, 05 May 2022 00:00:00 GMT<h3 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelines" target="_blank" rel="nofollow noopener noreferrer">Earlier</a>, we built a pipeline that produces a trained Computer Vision model. Now we need a way to efficiently tune its configuration and the hyperparameters of the model. We want the ability to:</p> <ul> <li>Run many experiments and easily compare their results to pick the best-performing ones.</li> <li>Track the global history of the model's performance, and map each improvement to a particular change in code, configuration, or data.</li> <li>Zoom into the details of each training run to help us diagnose issues.</li> </ul> <h3 id="experiment-management" style="position:relative;">Experiment Management<a href="#experiment-management" aria-label="experiment management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our DVC pipeline relies on the parameters defined in the<a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/params.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>params.yaml</code></a> file in this case (see other possible file types <a href="https://dvc.org/doc/command-reference/params#description" target="_blank" rel="nofollow noopener noreferrer">here</a>). By loading its contents in each stage, we can avoid hard-coded parameters. It also allows rerunning the whole or parts of our pipeline under a different set of parameters. The DVC pipeline YAML file <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/dvc.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>dvc.yaml</code></a> supports a <a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">templating format</a> to insert values from different sources in the YAML structure itself.</p> <p>DVC tracks which stages of the pipeline experienced changes and only reruns those. By changes, we mean <em>everything</em> that might affect the predictive performance of your model like changes to the dataset, source code and/or parameters. This not only ensures complete reproducibility but often significantly reduces the time needed to rerun the whole pipeline while ensuring consistent results on every rerun. For example, at first, we started with a pixel accuracy metric (the percent of pixels in your image that are classified correctly). Later, we realized that it might not be the best metric to track (as described in <a href="https://towardsdatascience.com/metrics-to-evaluate-your-semantic-segmentation-model-6bcb99639aa2" target="_blank" rel="nofollow noopener noreferrer">this blog post</a>), and we decided to include the Dice coefficient into our metrics. There is no reason for us to rerun the often time-consuming data preprocessing and model training stages if we want to incorporate these updates. DVC pipelines can skip the execution of these stages without our explicit instructions:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>Running stage 'check_packages': > pipenv run pip freeze > requirements.txt Stage 'data_load' didn't change, skipping Stage 'data_split' didn't change, skipping Stage 'train' didn't change, skipping Running stage 'evaluate': > python src/stages/eval.py --config=params.yaml ...</code></pre></div> <p>There is a super convenient set of <a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">Experiment Management</a> features that make switching between reproducible experiments very easy without adding failed experiments to your git history. Check out this <a href="https://dvc.org/blog/ml-experiment-versioning" target="_blank" rel="nofollow noopener noreferrer">blog post</a>, which talks about the idea of "ML Experiments as Code." That means treating experiments as you'd treat code, that is, use git to track all changes in configs, metrics, and data versions through text files. This approach removes the need for a separate database/online service to store experiment metadata. If wanted to run a few experiments with different scales of learning rate values (e.g. <code>0.1</code>, <code>0.01</code> and <code>0.001</code>), we'd do that as follows:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.learning_rate</span><span class="token operator">=</span><span class="token number">0.1</span> </span>... <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.learning_rate</span><span class="token operator">=</span><span class="token number">0.01</span> </span>... <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.learning_rate</span><span class="token operator">=</span><span class="token number">0.001</span> </span>...</code></pre></div> <p>Optionally, you can delay the execution of the experiments by putting them in a <a href="https://dvc.org/doc/user-guide/experiment-management/running-experiments#the-experiments-queue" target="_blank" rel="nofollow noopener noreferrer">queue</a>, and execute them later with the <a href="https://dvc.org/doc/command-reference/exp/run#--run-all"><code>dvc exp run --run-all</code></a> command.</p> <p>These local experiments are powered by Git references, and you can learn about them in <a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">this post</a>. We can display all experiments with the <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--only-changed</span> <span class="token parameter variable">--sort-by</span><span class="token operator">=</span>dice_mean</span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable">────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>train.loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>valid.loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>foreground.acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>jaccard.coeff<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>dice.multi<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>dice_mean<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc_mean<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.learning_rate<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.batch_size<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>models<span class="token hide">**</span></span></span> </span> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace - 0.10356 0.069076 0.90321 0.75906 0.92371 0.70612 0.97689 0.01 16 5854528 exp Apr 09, 2022 0.13305 0.087599 0.77803 0.66494 0.89084 0.70534 0.97891 0.01 8 6c513ae ├── 83a4975 [exp-2d80e] Apr 09, 2022 0.11189 0.088695 0.86905 0.75296 0.92005 0.70612 0.97689 0.01 16 5854528 ├── 675efb3 [exp-6c274] Apr 09, 2022 0.10356 0.069076 0.90321 0.75906 0.92371 0.71492 0.98099 0.1 16 770745a └── c8b1857 [exp-04bcd] Apr 09, 2022 0.11189 0.088695 0.86905 0.75296 0.92005 0.71619 0.98025 0.01 8 094c420 </span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>Once we identify one or a few best ones (e.g., highest <code>dice_mean</code> score), we can <a href="https://dvc.org/doc/user-guide/experiment-management/persisting-experiments" target="_blank" rel="nofollow noopener noreferrer">persist</a> them by creating a branch out of an experiment:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp branch</span> exp-04bcd my-branch </span>Git branch 'my-branch' has been created from experiment 'exp-04bcd'. To switch to the new branch run: git checkout my-branch</code></pre></div> <p>To track detailed information about the training process, we integrated <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a> into the training code by <a href="https://github.com/iterative/magnetic-tiles-defect/blob/41a057cf9b9a4a738087c8ad046b99c21f4faf17/src/utils/train_utils.py#L45" target="_blank" rel="nofollow noopener noreferrer">adding a callback object</a> to the training function. DVCLive is a Python library for logging machine learning metrics and other metadata in simple file formats, which is fully compatible with DVC.</p> <h2 id="collaboration-and-reporting-with-iterative-studio" style="position:relative;">Collaboration and Reporting with Iterative Studio<a href="#collaboration-and-reporting-with-iterative-studio" aria-label="collaboration and reporting with iterative studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>What if we needed to report the results to our team members or maybe hand over the project to one of them? How do we communicate everything we did since the conception of the project? What things resulted in the most significant improvements? What things didn't seem to matter at all?</p> <p><a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a> is a web-based application with seamless integration with DVC for data and model management, experiment tracking, visualization, and automation. It becomes especially valuable when collaborating with others on the same project or when there's a need to summarize the progress of the project through metrics and plots. All that's needed is to connect the project's repository with Studio. Then Studio will automatically parse all required information from <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, <code>params.yaml</code>, and other text files that DVC recognizes. The result will be a repository view. The view for our project is <a href="https://studio.datachain.ai/user/alex000kim/views/magnetic-tiles-defect-5kozhnu9jo" target="_blank" rel="nofollow noopener noreferrer">here</a>. It displays commits, metrics, parameters, the remote location of data and models tracked by DVC, and more.</p> <p>In the screenshot below, you can see that we created a separate <code>exp</code> branch that displays the results of the local experiments that we decided to upload to our remote repository, like trying different learning rates and batch sizes. Note that earlier, we discarded all local experiments whose performance we weren't satisfied with.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/113731a6cdf78d5f57dc0d416fcffd28/39600/studio_view.png" alt="Studio view" title="Studio view" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Below we can see the evolution of the key metrics and the value of the loss function throughout training (enabled by the earlier integration of <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>) for a set of selected commits.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9ca3705395f114f2c0d24507db503398/39600/dvc_live_studio.png" alt="DVCLive metrics displayed in Studio" title="DVCLive metrics displayed in Studio" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Now, for example, if we see that the loss function hasn't reached a plateau after a certain number of epochs, we'll try increasing this number. Or, even worse, if we see the loss function growing over time, it'll be an indication that our learning rate may be too high. In this case, we may generate a few additional experiments with lower learning rate values, eventually picking the one that achieves good model performance after a reasonable number of training epochs.</p> <h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In this post, we talked about the following:</p> <ul> <li>How to run and view ML experiments locally and commit the most promising ones to the remote git repository</li> <li>How the integration of Iterative Studio with DVC enables collaboration, traceability, and reporting on projects with multiple team members</li> <li>How DVCLive allows us to peek into the training process and helps us decide what ideas to try next</li> </ul> <p>What if we don't have a machine with a powerful GPU, and we'd like to take advantage of our cloud infrastructure? What if we'd like to have a custom report (with metrics, plots, and other visuals) accompany every commit/pull request on GitHub? The third (and last) part of this series of posts will demonstrate how another open-source tool from the Iterative ecosystem, <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>, addresses these issues.</p>https://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelineshttps://dvc.org/blog/end-to-end-computer-vision-api-part-1-data-versioning-and-ml-pipelinesTue, 03 May 2022 00:00:00 GMT<h3 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In this series of posts, we'll describe an approach that streamlines the lifecycles stages of a typical Computer Vision project going from proof-of-concept to configuration and parameter tuning to, finally, deployment to the production environment.</p> <p>Automatic defect detection is a common problem encountered in many industries, especially manufacturing. A typical setup would include a conveyor belt that moves some products along the production line and a camera installed above the conveyor. The camera takes pictures of the products moving below and connects to a computer that controls it. This computer needs to send raw images to some defect detection service, receive information about the location and size of the defects, if any, and may even control what happens to a defective product by being connected to a robotic arm via a PLC (programmable logic controller).</p> <p>As our demo project, we've selected a very common deployment pattern for this setup: a CV model wrapped in a web API service. Specifically, we'll perform an <a href="https://ai.stanford.edu/~syyeung/cvweb/tutorial3.html" target="_blank" rel="nofollow noopener noreferrer">image segmentation</a> task on a magnetic tiles dataset first introduced in this <a href="https://www.researchgate.net/profile/Congying-Qiu/publication/327701995_Saliency_defect_detection_of_magnetic_tiles/links/5b9fd1bd45851574f7d25019/Saliency-defect-detection-of-magnetic-tiles.pdf" target="_blank" rel="nofollow noopener noreferrer">paper</a> and available in this GitHub <a href="https://github.com/abin24/Magnetic-tile-defect-datasets." target="_blank" rel="nofollow noopener noreferrer">repository</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ccf0a3239a685a57ccdab7f42d52f25f/39600/dataset_sample.png" alt="A sample from the image segmentation dataset we used for this project. Top row: images of magnetic tile surfaces. Bottom row: segmentation mask (white pixels show defective areas)" title="A sample from the image segmentation dataset we used for this project. Top row: images of magnetic tile surfaces. Bottom row: segmentation mask (white pixels show defective areas)" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <ul> <li>This post (part 1) introduces the concepts of data versioning and ML pipelines as they apply to Computer Vision projects.</li> <li>Part 2 will focus on experiment tracking and management - key components needed for effective collaboration between team members.</li> <li>In part 3, you’ll learn how to easily move your model training workloads from a local machine to cloud infrastructure and set up proper CI/CD workflows for ML projects.</li> </ul> <h3 id="target-audience" style="position:relative;">Target Audience<a href="#target-audience" aria-label="target audience permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We assume the target audience of this post to be technical folks who are familiar with the general Computer Vision concepts, CI/CD processes, and Cloud infrastructure. Familiarity with the Iterative ecosystem of tools such as <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>, and <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Studio</a> is not required but would help with understanding the nuances of our solution.</p> <h3 id="summary-of-the-solution" style="position:relative;">Summary of the Solution<a href="#summary-of-the-solution" aria-label="summary of the solution permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>All the code for the project is stored in this GitHub <a href="https://github.com/iterative/magnetic-tiles-defect" target="_blank" rel="nofollow noopener noreferrer">repository</a>.</p> <p>The CV API solution that we are proposing can be summarized in the following steps:</p> <ul> <li>Client service will submit the image to our API endpoint</li> <li>The image will be preprocessed to adhere to the specifications that our model expects</li> <li>The CV model will ingest the processed image and output its prediction image mask</li> <li>Some postprocessing will be applied to the image mask</li> <li>A reply back to the client with the output mask</li> </ul> <p>The repository also contains code for the web application itself, which can be found in the <a href="https://github.com/iterative/magnetic-tiles-defect/tree/main/app" target="_blank" rel="nofollow noopener noreferrer"><code>app</code></a> directory. While the web application is very simple, its implementation is beyond the scope of this blog post. In short, we can say that it's based on the <a href="https://fastapi.tiangolo.com/" target="_blank" rel="nofollow noopener noreferrer"><code>FastAPI</code></a> library, and we deploy it to the Heroku platform through a Docker container defined in this <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/Dockerfile" target="_blank" rel="nofollow noopener noreferrer"><code>Dockerfile</code></a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 671px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b38f70c06e772513379790ada3051d4b/508aa/web_api_diagram.png" alt="Simplified diagram of the CV API solution" title="Simplified diagram of the CV API solution" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="prerequisites-for-reproduction" style="position:relative;">Prerequisites for Reproduction<a href="#prerequisites-for-reproduction" aria-label="prerequisites for reproduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Feel free to fork the <a href="https://github.com/iterative/magnetic-tiles-defect" target="_blank" rel="nofollow noopener noreferrer">repository</a> if you'd like to replicate our steps and deploy your own API service. Keep in mind that you'll need to set up and configure the following:</p> <ul> <li>GitHub account and <a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token" target="_blank" rel="nofollow noopener noreferrer">GitHub application token</a></li> <li><a href="https://pipenv.pypa.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer"><code>pipenv</code></a> installed locally</li> <li>AWS account, <a href="https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/" target="_blank" rel="nofollow noopener noreferrer">access keys</a>, and an <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html" target="_blank" rel="nofollow noopener noreferrer">S3 bucket</a></li> <li>Heroku account and <a href="https://help.heroku.com/PBGP6IDE/how-should-i-generate-an-api-key-that-allows-me-to-use-the-heroku-platform-api" target="_blank" rel="nofollow noopener noreferrer">Heroku API key</a></li> </ul> <p>For security reasons, you'll need to set up all keys and tokens through <a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets" target="_blank" rel="nofollow noopener noreferrer">GitHub secrets</a>. You'll also need to change the remote location (and its name) in the <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.dvc/config" target="_blank" rel="nofollow noopener noreferrer">DVC config</a> file for versioning data and other artifacts.</p> <h3 id="proof-of-concept-in-jupyter-notebooks" style="position:relative;">Proof-of-Concept in Jupyter Notebooks<a href="#proof-of-concept-in-jupyter-notebooks" aria-label="proof of concept in jupyter notebooks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>A typical ML project would start with data collection and/or labeling, but we are skipping all this hard work because it was done for us by the researchers who published the dataset.</p> <p>We'll get right to the exciting part of training CV models in Jupyter notebooks which you can find <a href="https://github.com/iterative/magnetic-tiles-defect/tree/main/notebooks" target="_blank" rel="nofollow noopener noreferrer">here</a>. In short, there we have three notebooks:</p> <ol> <li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/notebooks/1_ProcessData.ipynb" target="_blank" rel="nofollow noopener noreferrer"><code>1_ProcessData.ipynb</code></a> downloads, processes, and organizes the data for easy loading into the training process later</li> <li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/notebooks/2_TrainSegmentationModel.ipynb" target="_blank" rel="nofollow noopener noreferrer"><code>2_TrainSegmentationModel.ipynb</code></a> uses <a href="https://github.com/fastai/fastai" target="_blank" rel="nofollow noopener noreferrer"><code>fastai</code></a> Deep Learning framework to train an image segmentation model</li> <li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/notebooks/3_Evaluate.ipynb" target="_blank" rel="nofollow noopener noreferrer"><code>3_Evaluate.ipynb</code></a> computes model performance on the test dataset</li> </ol> <p>Jupyter Notebook is by far the most popular tool for quick exploratory work when it comes to data analysis and modeling. However, it's not without <a href="https://www.youtube.com/watch?v=7jiPeIFXb6U" target="_blank" rel="nofollow noopener noreferrer">its own limitations</a>. One of the biggest issues of Jupyter is that it has no guardrails to ensure reproducibility, e.g. hidden states of variables and objects as well as the possibility to run cells out of order. While there are several projects that attempt to alleviate some of these issues (notably, <a href="https://github.com/stitchfix/nodebook" target="_blank" rel="nofollow noopener noreferrer"><code>nodebook</code></a>, <a href="https://github.com/nteract/papermill" target="_blank" rel="nofollow noopener noreferrer"><code>papermill</code></a>, <a href="https://github.com/jupyter/nbdime" target="_blank" rel="nofollow noopener noreferrer"><code>nbdime</code></a>, <a href="https://github.com/computationalmodelling/nbval" target="_blank" rel="nofollow noopener noreferrer"><code>nbval</code></a>, <a href="https://github.com/kynan/nbstripout" target="_blank" rel="nofollow noopener noreferrer"><code>nbstripout</code></a>, and <a href="https://github.com/nbQA-dev/nbQA" target="_blank" rel="nofollow noopener noreferrer"><code>nbQA</code></a>), they don’t solve them completely.</p> <p>That's where the concepts of data versioning and ML pipelines come in.</p> <h3 id="data-versioning" style="position:relative;">Data Versioning<a href="#data-versioning" aria-label="data versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In most ML projects, training data changes gradually over time as new training instances (images in our case) get added while older ones might be removed. Simply creating snapshots of our training data at the time of training (e.g. labeling data directories with dates) quickly becomes unsustainable since these snapshots will contain many duplicates. Additionally, tracking which data directory was used to train each model becomes hard to manage very fast; and linking data versions and models to their respective code versions complicates things even further.</p> <p>A much better approach is to:</p> <ol> <li> <p>track only the deltas between different versions of the datasets; and</p> </li> <li> <p>have the project’s git repository store only the reference links to the data while the actual data is stored in a remote storage</p> </li> </ol> <p>This is exactly what we can do with <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> by running only a couple of DVC commands. In turn, DVC handles all the underlying complexity of managing data versions, performing file deduplication, pushing and pulling to/from different remote storage solutions and more.</p> <p>Check out <a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files/tutorial" target="_blank" rel="nofollow noopener noreferrer">this tutorial</a> to learn more about data and model versioning with DVC.</p> <p><img src="https://editor.analyticsvidhya.com../uploads/86351git-dvc.png" alt="Diagram of how DVC performs data versioning "></p> <p>In this project, AWS S3 is our remote storage configured in the <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/.dvc/config" target="_blank" rel="nofollow noopener noreferrer"><code>.dvc/config</code></a> file. In other words, we store the images in an AWS bucket while only keeping references to those files in our git repository.</p> <h3 id="refactoring-jupyter-code-into-an-ml-pipeline" style="position:relative;">Refactoring Jupyter code into an ML pipeline<a href="#refactoring-jupyter-code-into-an-ml-pipeline" aria-label="refactoring jupyter code into an ml pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Another powerful set of DVC features is ML pipelines. An ML pipeline is a way to codify and automate the workflow used to reproduce a machine learning model. A pipeline consists of a sequence of stages.</p> <p>First, we did some refactoring of our Jupyter code into individual and self-contained modules:</p> <ul> <li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/data_load.py" target="_blank" rel="nofollow noopener noreferrer"><code>data_load.py</code></a> downloads raw data locally</li> <li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/data_split.py" target="_blank" rel="nofollow noopener noreferrer"><code>data_split.py</code></a> splits data into train and test subsets</li> <li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/train.py" target="_blank" rel="nofollow noopener noreferrer"><code>train.py</code></a> uses <a href="https://github.com/fastai/fastai" target="_blank" rel="nofollow noopener noreferrer"><code>fastai</code></a> library to train a UNet model with a ResNet-34 encoder and saves it into a pickle file</li> <li><a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/src/stages/eval.py" target="_blank" rel="nofollow noopener noreferrer"><code>eval.py</code></a> evaluates the model's performance on the test subset</li> </ul> <p>Specific execution commands, dependencies, and outputs of each stage are defined in the pipeline file <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/dvc.yaml" target="_blank" rel="nofollow noopener noreferrer"><code>dvc.yaml</code></a> (more about pipelines files <a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files" target="_blank" rel="nofollow noopener noreferrer">here</a>).</p> <p>We've also added an optional <a href="https://github.com/iterative/magnetic-tiles-defect/blob/main/dvc.yaml#L2" target="_blank" rel="nofollow noopener noreferrer"><code>check_packages</code></a> stage that freezes the environment into a <code>requirements.txt</code> file containing all python packages and their versions installed in the environment. We enabled the <a href="https://dvc.org/doc/command-reference/run#--always-changed" target="_blank" rel="nofollow noopener noreferrer"><code>always_changed</code></a> field in the configuration of this stage to ensure DVC reruns this stage every time. All other stages have this text file as a dependency. Thus, the entire pipeline will be rerun if anything about our python environment changes.</p> <p>We can see the whole dependency graph (directed acyclic graph, to be exact) using the <a href="https://dvc.org/doc/command-reference/dag" target="_blank" rel="nofollow noopener noreferrer"><code>dvc dag</code></a> command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc dag</span> </span> +----------------+ | check_packages | *****+----------------+ ***** * ** ** **** ** ** *** *** ** ** *** +-----------+ ** * *** | data_load | ** * * +-----------+ ** * * *** ** * * * ** * * ** * * * +------------+ * * | data_split |*** * * +------------+ *** * * * *** * * * *** * * * ** * * ** +-------+ *** *** | train | *** *** +-------+ *** *** ** *** *** ** *** ** *** +----------+ | evaluate | +----------+</code></pre></div> <p>The entire pipeline can be easily reproduced with the <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>Running stage 'check_packages': > python src/stages/check_pkgs.py --config=params.yaml ... Running stage 'data_load': > python src/stages/data_load.py --config=params.yaml ... Running stage 'data_split': > python src/stages/data_split.py --config=params.yaml ... Running stage 'train': > python src/stages/train.py --config=params.yaml ... Running stage 'evaluate': > python src/stages/eval.py --config=params.yaml ...</code></pre></div> <h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In this first part of the blog post, we talked about the following:</p> <ul> <li>Common difficulties when building Computer Vision Web API for defect detection</li> <li>Pros and cons of exploratory work in Jupyter Notebooks</li> <li>Versioning data in remote storage with DVC</li> <li>Moving and refactoring the code from Jupyter Notebooks into DVC pipeline stages</li> </ul> <p>In the second part, we’ll see how to get the most out of experiment tracking and management by seamlessly integrating <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, <a href="https://github.com/iterative/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive</a>, and <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a>.</p>https://dvc.org/blog/april-22-community-gemshttps://dvc.org/blog/april-22-community-gemsThu, 28 Apr 2022 00:00:00 GMT<h3 id="when-i-run-dvc-repro-on-a-stage-does-it-automatically-push-any-outputs-to-my-remote" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/953616587523498025" target="_blank" rel="nofollow noopener noreferrer">When I run <code>dvc repro</code> on a stage, does it automatically push any outputs to my remote?</a><a href="#when-i-run-dvc-repro-on-a-stage-does-it-automatically-push-any-outputs-to-my-remote" aria-label="when i run dvc repro on a stage does it automatically push any outputs to my remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Great question from @tina_rey!</p> <p>The <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command doesn't automatically push any outputs or data to your remote. The outputs are stored in the cache until you run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, which then pushes them from your cache to your remote.</p> <h3 id="is-dvc-dag-based-on-deps-and-outs-so-that-a-stage-that-depends-on-the-output-of-another-stage-will-always-be-executed-after-the-former-has-finished" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/956113493155799070" target="_blank" rel="nofollow noopener noreferrer">Is <code>dvc dag</code> based on <code>deps</code> and <code>outs</code>, so that a stage that depends on the output of another stage will always be executed after the former has finished?</a><a href="#is-dvc-dag-based-on-deps-and-outs-so-that-a-stage-that-depends-on-the-output-of-another-stage-will-always-be-executed-after-the-former-has-finished" aria-label="is dvc dag based on deps and outs so that a stage that depends on the output of another stage will always be executed after the former has finished permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a good question from @johnysku!</p> <p>That is correct! If the pipelines are independent or the stages are independent, they may run in any order. Without explicit dependency linkage, stages could be executed in an unexpected order.</p> <h3 id="if-i-want-to-use-the-foreach-utility-in-dvc-repro-is-there-a-way-i-can-use-glob-patterns-to-create-the-list-dvc-needs-to-iterate-over" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/956241424150577233" target="_blank" rel="nofollow noopener noreferrer">If I want to use the <code>foreach</code> utility in <code>dvc repro</code>, is there a way I can use glob patterns to create the list DVC needs to iterate over?</a><a href="#if-i-want-to-use-the-foreach-utility-in-dvc-repro-is-there-a-way-i-can-use-glob-patterns-to-create-the-list-dvc-needs-to-iterate-over" aria-label="if i want to use the foreach utility in dvc repro is there a way i can use glob patterns to create the list dvc needs to iterate over permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Another interesting question from @copah!</p> <p>If you have <code>mystage</code> which uses <code>foreach</code>, you can do <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> to <code>mystage</code> to iterate over every <code>mystage</code> stage.</p> <h3 id="how-does-dvc-handle-files-that-have-been-deleted-from-remote-storage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/956254582676258866" target="_blank" rel="nofollow noopener noreferrer">How does DVC handle files that have been deleted from remote storage?</a><a href="#how-does-dvc-handle-files-that-have-been-deleted-from-remote-storage" aria-label="how does dvc handle files that have been deleted from remote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Really good question from @Meme Philosopher!</p> <p>DVC will fail when you try to pull files that have been deleted from the remote and notify you that those files are missing in remote storage.</p> <h3 id="can-i-separate-cml-running-from-github-actions-vm-to-work-with-gcp-or-aws-so-training-and-testing-are-in-these-cloud-environments" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/954316332457947169" target="_blank" rel="nofollow noopener noreferrer">Can I separate CML running from GitHub actions VM to work with GCP or AWS so training and testing are in these cloud environments?</a><a href="#can-i-separate-cml-running-from-github-actions-vm-to-work-with-gcp-or-aws-so-training-and-testing-are-in-these-cloud-environments" aria-label="can i separate cml running from github actions vm to work with gcp or aws so training and testing are in these cloud environments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @Atsu!</p> <p>This is supported out-of-the-box! Here's how it works:</p> <ol> <li>Within Github Actions, CML launches a <a href="https://cml.dev/doc/self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">self-hosted runner</a> on GCP or AWS using <code>cml runner --labels=cml --cloud=gcp</code>/<code>--cloud=aws</code></li> <li>GitHub Actions runs the rest of the workflow on the self-hosted runner using <code>runs-on: [self-hosted, cml]</code> and the maximum allowable <code>timeout-minutes: 4320</code></li> <li>If GitHub Actions is about to timeout, CML will restart the workflow, so make sure your code regularly caches and restores data if it's expected to take >3 days to run.</li> </ol> <p>You can follow along with <a href="https://cml.dev/doc/self-hosted-runners?tab=GitHub#allocating-cloud-compute-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">this doc</a> to get started.</p> <p>The key is requesting GitHub's <a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#usage-limits" target="_blank" rel="nofollow noopener noreferrer">maximum <code>timeout-minutes: 4320</code></a>. This signals to CML to <a href="https://cml.dev/doc/ref/runner#faqs-and-known-issues" target="_blank" rel="nofollow noopener noreferrer">restart the workflow</a> just before the timeout. You'll also have to write your code to cache results so that the restarted workflow will use previous results (e.g. use <a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints</a> and <a href="https://github.com/iterative/dvc/issues/6823" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc/issues/6823</a>)</p> <h3 id="when-running-an-experiment-from-the-web-interface-with-dvc-is-there-any-way-to-get-the-new-metrics-to-show-on-the-commit-created-by-iterative-studio-for-the-experiment" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/957931058639306772" target="_blank" rel="nofollow noopener noreferrer">When running an experiment from the web interface with DVC, is there any way to get the new metrics to show on the commit created by Iterative Studio for the experiment?</a><a href="#when-running-an-experiment-from-the-web-interface-with-dvc-is-there-any-way-to-get-the-new-metrics-to-show-on-the-commit-created-by-iterative-studio-for-the-experiment" aria-label="when running an experiment from the web interface with dvc is there any way to get the new metrics to show on the commit created by iterative studio for the experiment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Awesome question about Studio from @Benjamin-Etheredge!</p> <p>In order to show the experiment results in Studio, you would have to commit and push the results as part of your CI (continuous integration) action. Here's an <a href="https://github.com/iterative/demo-fashion-mnist/blob/main/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">example GitHub action script</a> that does this.</p> <p>We do understand that it is not ideal that there are 2 commits, one with your changes and one with the results. We have been thinking about how this can be improved and it would be great to hear if you have <a href="https://github.com/iterative/studio-support/" target="_blank" rel="nofollow noopener noreferrer">any thoughts/ideas</a>!</p> <h3 id="is-there-a-way-to-get-dvc-to-import-from-a-private-repository" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/964204106824695868" target="_blank" rel="nofollow noopener noreferrer">Is there a way to get DVC to import from a private repository?</a><a href="#is-there-a-way-to-get-dvc-to-import-from-a-private-repository" aria-label="is there a way to get dvc to import from a private repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Good question from @qubvel!</p> <p>You can use SSH to handle this and run the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> [email protected]:<span class="token operator"><</span>reposiotry location<span class="token operator">></span> <span class="token operator"><</span>data_path<span class="token operator">></span></span></code></pre></div> <h3 id="if-i-use-a-local-remote-and-a-shared-cache-will-the-data-be-symlinked-from-the-remote-to-the-cache" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/963768504987815987" target="_blank" rel="nofollow noopener noreferrer">If I use a local remote and a shared cache, will the data be symlinked from the remote to the cache?</a><a href="#if-i-use-a-local-remote-and-a-shared-cache-will-the-data-be-symlinked-from-the-remote-to-the-cache" aria-label="if i use a local remote and a shared cache will the data be symlinked from the remote to the cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Very interesting question from @cajoek!</p> <p>The data will <em>not</em> be symlinked from the remote to the cache.</p> <p>Sometimes we can treat cache as something temporary so a lot of data that will never be used can get there from failed experiments, etc. In this case having a local remote to keep track of important data for important versions of your project would be good.</p> <p>That way, later when your cache is too big and the project takes up too much space, you can remove <code>.dvc/cache</code> and download latest important version from remote.</p> <hr> <p><img src="https://media.giphy.com/media/f8QPB1rgHbwhcD2Jd6/giphy.gif" alt="iAM_Learning GIF"></p> <p>At our May Office Hours Meetup we will have Matt Squire of Fuzzy Labs join us sharing his view on open source MLOps tools! <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/285550813" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/terraform-providerhttps://dvc.org/blog/terraform-providerWed, 27 Apr 2022 00:00:00 GMT<p>The requirements for Machine Learning (ML) infrastructure are becoming increasingly complex. Training large models often requires specialized hardware (GPUs, TPUs) which involves moving the whole training process onto cloud machines, setting up environments and synchronizing data. For teams that want to leverage spot instances, the setup becomes even more complex — they need to make sure the training progress is not lost during spot instance recovery. This is time-consuming, and requires expertise in both DevOps and Machine Learning. Additionally, training in a cloud environment can incur high costs due to the need for expensive hardware, as well as users forgetting to shutdown instances when training is complete.</p> <p>To address the specific needs of machine learning teams, we have built <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Terraform Provider Iterative (TPI)</a>. TPI is an open-source tool extending the functionality of Terraform, the world's most widely used multi-cloud provisioning product. The Iterative Provider enables full lifecycle management of computing resources and is designed specifically for machine learning pipelines.</p> <h2 id="tailored-to-machine-learning-workflows" style="position:relative;">Tailored to Machine Learning Workflows<a href="#tailored-to-machine-learning-workflows" aria-label="tailored to machine learning workflows permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The Iterative Provider offers a single resource called <code>iterative_task</code> which you can use to configure:</p> <ul> <li>Your cloud infrastructure</li> <li>The steps to perform on the cloud resource, i.e. setting up the environment, running the training pipeline, logging metrics, etc.</li> <li>The data to be synced back once the training is complete (e.g. a file with metrics, a model, plots)</li> </ul> <p>Here’s a “hello world” example of a <code>main.tf</code> Terraform configuration file using the <code>iterative_task</code> resource:</p> <div class="gatsby-highlight" data-language="hcl"><pre class="language-hcl"><code class="language-hcl"><span class="token keyword">terraform</span> <span class="token punctuation">{</span> <span class="token keyword">required_providers</span> <span class="token punctuation">{</span> <span class="token property">iterative</span> <span class="token punctuation">=</span> <span class="token punctuation">{</span> <span class="token property">source</span> <span class="token punctuation">=</span> <span class="token string">"iterative/iterative"</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span> <span class="token keyword">provider<span class="token type variable"> "iterative" </span></span><span class="token punctuation">{</span><span class="token punctuation">}</span> <span class="token keyword">resource <span class="token type variable">"iterative_task"</span></span> <span class="token string">"example"</span> <span class="token punctuation">{</span> <span class="token property">cloud</span> <span class="token punctuation">=</span> <span class="token string">"aws"</span> <span class="token comment"># or any of: gcp, az, k8s</span> <span class="token property">machine</span> <span class="token punctuation">=</span> <span class="token string">"m"</span> <span class="token comment"># medium. Or any of: l, xl, m+k80, xl+v100, ...</span> <span class="token property">image</span> <span class="token punctuation">=</span> <span class="token string">"ubuntu"</span> <span class="token comment"># or "nvidia", ...</span> <span class="token property">region</span> <span class="token punctuation">=</span> <span class="token string">"us-west"</span> <span class="token comment"># or us-west, eu-east, ...</span> <span class="token property">disk_size</span> <span class="token punctuation">=</span> <span class="token number">30</span> <span class="token comment"># GB</span> <span class="token property">spot</span> <span class="token punctuation">=</span> <span class="token number">0</span> <span class="token comment"># auto-price. Default -1 to disable or >0 for hourly USD limit</span> <span class="token property">timeout</span> <span class="token punctuation">=</span> <span class="token number">24</span>*<span class="token number">60</span>*<span class="token number">60</span> <span class="token comment"># max 24h before forced termination</span> <span class="token keyword">storage</span> <span class="token punctuation">{</span> <span class="token property">workdir</span> <span class="token punctuation">=</span> <span class="token string">"."</span> <span class="token property">output</span> <span class="token punctuation">=</span> <span class="token string">"results"</span> <span class="token punctuation">}</span> <span class="token property">script</span> <span class="token punctuation">=</span> <span class="token heredoc string"><<-END #!/bin/bash sudo apt update sudo apt install -y python3-pip pip3 install --user -r requirements.txt python3 train.py END</span> <span class="token punctuation">}</span></code></pre></div> <p>Once the training is complete, the Iterative Provider terminates the resource, so users don't have to worry about spiraling costs from unused machines.</p> <h2 id="configure-once-bring-everywhere" style="position:relative;">Configure Once, Bring Everywhere<a href="#configure-once-bring-everywhere" aria-label="configure once bring everywhere permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Once you configure infrastructure and a script that executes your training pipeline in a Terraform configuration file, you can bring that pipeline anywhere you want. You can use such a config for ad-hoc training at any stage of your prototyping process or use it as a job in your preferred CI/CD tool. You can also store your infrastructure configuration files in a version control system together with the rest of your project for easier control.</p> <h2 id="one-provider-to-rule-them-all" style="position:relative;">One Provider to Rule Them All<a href="#one-provider-to-rule-them-all" aria-label="one provider to rule them all permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Whether you prefer Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), or Kubernetes (K8s), the Iterative Provider has you covered. You can configure compute resources from these with a unified API, using <a href="https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type" target="_blank" rel="nofollow noopener noreferrer">common machine types</a> that are the same across all cloud vendors. This significantly simplifies infrastructure configuration and makes it easy to migrate from one cloud to another by changing just one line of code.</p> <h2 id="costs-optimization" style="position:relative;">Costs Optimization<a href="#costs-optimization" aria-label="costs optimization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The Iterative Provider helps with cloud compute cost optimization in two major ways. First, upon completion of your script, the instance is automatically terminated. This helps to avoid accumulating costs due to abandoned resources. Second, you can leverage the cost-saving power of spot instances to train your models without losing any progress! TPI recovers the working directory and respawns interrupted/preempted instances for you.</p> <h2 id="devops-friendly" style="position:relative;">DevOps-Friendly<a href="#devops-friendly" aria-label="devops friendly permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Last, but not least, the Iterative Provider aims to bridge the gap between DevOps and Data Science teams. We build on top of Terraform, a tool universally familiar to DevOps teams, but extend it to suit ML needs.</p> <p>If you’d like to try the Iterative Provider in your project, check out the documentation on the provider’s page in the Terraform registry, and if you have any questions or suggestions, we welcome them in our <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">GitHub repository.</a></p>https://dvc.org/blog/CML-runners-saving-models-1https://dvc.org/blog/CML-runners-saving-models-1Tue, 26 Apr 2022 00:00:00 GMT<p>When you first develop a machine learning model, you will probably do so on your local machine. You can easily change algorithms, parameters, and input data right in your text editor, notebook, or terminal. Imagine you have a long-running model for which you want to detect possible <a href="https://en.wikipedia.org/wiki/Concept_drift" target="_blank" rel="nofollow noopener noreferrer">drift</a>, however. In that case it would be beneficial to automatically retrain your model on a regular basis.</p> <p>In this guide, we will show how you can use <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML (Continuous Machine Learning)</a> to do just that. CML is an open-source library for implementing continuous integration and delivery (CI/CD) in machine learning projects. This way we can define a pipeline to train a model and keep track of various versions. Although we could do so directly in our CI/CD pipeline (e.g. GitHub Actions Workflows), the runners used for this generally don’t have a lot of processing power. Therefore it makes more sense to provision a dedicated runner that is tailored to our computing needs.</p> <p>At the end of this guide we will have set up a CML workflow that does the following on a daily basis:</p> <ol> <li>Provision an Amazon Web Services (AWS) EC2 instance</li> <li>Train the model</li> <li>Save the model and its metrics to a GitHub repository</li> <li>Create a pull request with the new outputs</li> <li>Terminate the AWS EC2 instance</li> </ol> <p>In a follow-up post we will expand upon this by using <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> to designate a remote storage for our resulting models. But let's focus on CML first!</p> <p>All files needed for this guide can be found in <a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">this repository</a>.</p> <admon type="info"> <p>This guide can be followed on its own, but also as an extension to this <a href="https://cml.dev/doc/self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">example in the docs</a>.</p> </admon> <admon type="tip"> <p>We wil be using GitHub for our CI/CD and AWS for our computing resources. With slight modifications, however, you can use <a href="https://cml.dev/doc/self-hosted-runners?tab=GitLab#allocating-cloud-compute-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">GitLab CI/CD</a>, <a href="https://cml.dev/doc/self-hosted-runners?tab=GCP#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Google Cloud</a> or <a href="https://cml.dev/doc/self-hosted-runners?tab=Azure#cloud-compute-resource-credentials" target="_blank" rel="nofollow noopener noreferrer">Microsoft Azure</a>.</p> </admon> <h1 id="prerequisites" style="position:relative;">Prerequisites<a href="#prerequisites" aria-label="prerequisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Before we begin, make sure you have the following things set up:</p> <ol> <li>You have <a href="https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/" target="_blank" rel="nofollow noopener noreferrer">created an AWS account</a> (free tier suffices)</li> <li>You have <a href="https://cml.dev/doc/self-hosted-runners?tab=GitHub#personal-access-token" target="_blank" rel="nofollow noopener noreferrer">created a <code>PERSONAL_ACCESS_TOKEN</code> on GitHub</a> with the <code>repo</code> scope</li> <li>You have <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-creds" target="_blank" rel="nofollow noopener noreferrer">created an <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> on AWS</a></li> <li>You have <a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets" target="_blank" rel="nofollow noopener noreferrer">added the <code>PERSONAL_ACCES_TOKEN</code>, <code>AWS_ACCESS_KEY_ID</code>, and <code>AWS_SECRET_ACCESS_KEY</code> as GitHub secrets</a></li> </ol> <p>It also helps to clone <a href="https://github.com/iterative/example_model_export_cml" target="_blank" rel="nofollow noopener noreferrer">the template repository for this tutorial</a>.</p> <h1 id="training-a-model-and-saving-it" style="position:relative;">Training a model and saving it<a href="#training-a-model-and-saving-it" aria-label="training a model and saving it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>To kick off, we will adapt <code>train.py</code> from the <a href="https://cml.dev/doc/start/github" target="_blank" rel="nofollow noopener noreferrer">CML getting started guide</a>. Here we create a simple <code>RandomForestClassifier()</code> based on some generated data. We then use the model to make some predictions and plot those predictions in a confusion matrix.</p> <p>While running the script the model is kept in memory, meaning it is discarded as soon as the script finishes. In order to save the model for later, we need to dump it as a binary file. We do so with <a href="https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html" target="_blank" rel="nofollow noopener noreferrer"><code>joblib.dump()</code></a>. Later we can read the model using <a href="https://joblib.readthedocs.io/en/latest/generated/joblib.load.html" target="_blank" rel="nofollow noopener noreferrer"><code>joblib.load()</code></a> when we need to.</p> <admon type="tip"> <p>You can also use <code>pickle.dump()</code> if you prefer.</p> </admon> <p>The outputs of <code>train.py</code> are:</p> <ul> <li><code>metrics.txt</code>: a file containing metrics on model performance (in this case accuracy)</li> <li><code>confusion_matrix.png</code>: a plot showing the classification results of our model</li> <li><code>random_forest.joblib</code>: the binary output of the trained model</li> </ul> <p>All of these files are saved to the <code>model</code> directory.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> json <span class="token keyword">import</span> os <span class="token keyword">import</span> joblib <span class="token keyword">import</span> matplotlib<span class="token punctuation">.</span>pyplot <span class="token keyword">as</span> plt <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np <span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>ensemble <span class="token keyword">import</span> RandomForestClassifier <span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>metrics <span class="token keyword">import</span> plot_confusion_matrix <span class="token comment"># Read in data</span> X_train <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/train_features.csv"</span><span class="token punctuation">)</span> y_train <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/train_labels.csv"</span><span class="token punctuation">)</span> X_test <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/test_features.csv"</span><span class="token punctuation">)</span> y_test <span class="token operator">=</span> np<span class="token punctuation">.</span>genfromtxt<span class="token punctuation">(</span><span class="token string">"data/test_labels.csv"</span><span class="token punctuation">)</span> <span class="token comment"># Fit a model</span> depth <span class="token operator">=</span> <span class="token number">5</span> clf <span class="token operator">=</span> RandomForestClassifier<span class="token punctuation">(</span>max_depth<span class="token operator">=</span>depth<span class="token punctuation">)</span> clf<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>X_train<span class="token punctuation">,</span> y_train<span class="token punctuation">)</span> <span class="token comment"># Calculate accuracy</span> acc <span class="token operator">=</span> clf<span class="token punctuation">.</span>score<span class="token punctuation">(</span>X_test<span class="token punctuation">,</span> y_test<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>acc<span class="token punctuation">)</span> <span class="token comment"># Create model folder if it does not yet exist</span> <span class="token keyword">if</span> <span class="token keyword">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span><span class="token string">"model"</span><span class="token punctuation">)</span><span class="token punctuation">:</span> os<span class="token punctuation">.</span>makedirs<span class="token punctuation">(</span><span class="token string">"model"</span><span class="token punctuation">)</span> <span class="token comment"># Write metrics to file</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"model/metrics.txt"</span><span class="token punctuation">,</span> <span class="token string">"w+"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> outfile<span class="token punctuation">:</span> outfile<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">"Accuracy: "</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>acc<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"\n"</span><span class="token punctuation">)</span> <span class="token comment"># Plot confusion matrix</span> disp <span class="token operator">=</span> plot_confusion_matrix<span class="token punctuation">(</span>clf<span class="token punctuation">,</span> X_test<span class="token punctuation">,</span> y_test<span class="token punctuation">,</span> normalize<span class="token operator">=</span><span class="token string">"true"</span><span class="token punctuation">,</span> cmap<span class="token operator">=</span>plt<span class="token punctuation">.</span>cm<span class="token punctuation">.</span>Blues<span class="token punctuation">)</span> plt<span class="token punctuation">.</span>savefig<span class="token punctuation">(</span><span class="token string">"model/confusion_matrix.png"</span><span class="token punctuation">)</span> <span class="token comment"># Save the model</span> joblib<span class="token punctuation">.</span>dump<span class="token punctuation">(</span>clf<span class="token punctuation">,</span> <span class="token string">"model/random_forest.joblib"</span><span class="token punctuation">)</span></code></pre></div> <h1 id="train-the-model-on-a-daily-basis" style="position:relative;">Train the model on a daily basis<a href="#train-the-model-on-a-daily-basis" aria-label="train the model on a daily basis permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Now that we have a script to train our model and save it as a file, let’s set up our CI/CD to provision a runner and run the script. We define our workflow in <code>cml.yaml</code> and save it in the <code>.github/workflows</code> directory. This way GitHub will automatically run the workflow whenever it is triggered. In this case the triggers are on (manual) request as well as daily (automatic) schedule.</p> <admon type="info"> <p>The name of the workflow doesn’t matter, as long as it’s a <code>.yaml</code> and located in the <code>.github/workflows</code> directory. You can have multiple workflows in there as well. You can learn more in the <a href="https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions" target="_blank" rel="nofollow noopener noreferrer">documentation</a> here.</p> </admon> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> CML <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token comment"># Here we use two triggers: manually and daily at 08:00</span> <span class="token key atrule">workflow_dispatch</span><span class="token punctuation">:</span> <span class="token key atrule">schedule</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">cron</span><span class="token punctuation">:</span> <span class="token string">'0 8 * * *'</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Deploy runner on EC2 <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml runner \ --cloud=aws \ --cloud-region=eu-west \ --cloud-type=t2.micro \ --labels=cml-runner \ --single</span> <span class="token key atrule">train-model</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span> <span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">120</span> <span class="token comment"># 2h</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>node@v3 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">node-version</span><span class="token punctuation">:</span> <span class="token string">'16'</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml ci pip install -r requirements.txt python get_data.py python train.py</span></code></pre></div> <admon type="warn"> <p>In this example we are using a <code>t2.micro</code> <a href="https://aws.amazon.com/ec2/instance-types/" target="_blank" rel="nofollow noopener noreferrer">AWS EC2 instance</a>. At the time of writing this is included in the AWS free tier. Make sure that you qualify for this free usage to prevent unexpected spending. When you specify a bulkier <code>cloud-type</code>, your expenses will rise.</p> </admon> <p>The workflow we defined first <a href="https://cml.dev/doc/ref/runner" target="_blank" rel="nofollow noopener noreferrer">provisions a runner</a> on AWS, and then uses that runner to train the model. After completing the training job, CML automatically terminates the runner to prevent you from incurring further costs. Once the runner is terminated, however, the model is lost along with it. Let's see how we can save our model in the next step!</p> <h1 id="export-the-model-to-our-git-repository" style="position:relative;">Export the model to our Git repository<a href="#export-the-model-to-our-git-repository" aria-label="export the model to our git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>CML allows us to export the model from our runner to our Git repository. Let's extend the training stage of our workflow by pushing <code>random_forest.joblib</code> to a new experiment branch and creating a pull request.</p> <p><a href="https://cml.dev/doc/ref/pr" target="_blank" rel="nofollow noopener noreferrer"><code>cml pr</code></a> is the command that specifies which files should be included in the pull request. The commands after that are used to generate a report in the pull request that displays the confusion matrix and calculated metrics.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train-model</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span> <span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">120</span> <span class="token comment"># 2h</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>node@v3 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">node-version</span><span class="token punctuation">:</span> <span class="token string">'16'</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml ci pip install -r requirements.txt python get_data.py python train.py</span> <span class="token comment"># Create pull request</span> cml pr model/random_forest.joblib <span class="token comment"># Create CML report</span> cat model/metrics.txt <span class="token punctuation">></span> report.md cml publish model/confusion_matrix.png <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md cml send<span class="token punctuation">-</span>comment <span class="token punctuation">-</span><span class="token punctuation">-</span>pr <span class="token punctuation">-</span><span class="token punctuation">-</span>update report.md</code></pre></div> <p>Et voilà! We are now running a daily model training on an AWS EC2 instance and saving the resulting model to our GitHub repository.</p> <p>There is still some room for improvement, though. This approach works well when our resulting model is small (less than 100MB), but we wouldn't want to store large models in our Git repository. In a follow-up post we will describe how we can use <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, another Iterative open-source tool, for storage when we're dealing with larger files.</p> <h1 id="conclusions" style="position:relative;">Conclusions<a href="#conclusions" aria-label="conclusions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>There are many cases in which it's a good idea to retrain models periodically. For example, you could be using the latest data available to you in order to prevent model drift. CML allows you to automate this process.</p> <p>In this guide, we explored how to set up CML for a daily training job using a self-hosted runner. We automatically provisioned this runner on AWS, exported the resulting files to our Git repository, and terminated the runner to prevent racking up our AWS bill.</p> <p>In a follow-up post we will explore how to use DVC when the resulting model is too large to store directly in our Git repository.</p> <p>Another great extension of our CI/CD would be a <code>deploy</code> step to bring the latest version of our model into production. This step might be conditional on the performance of the model; we could decide to only start using it in production if it performs better than previous iterations. All of this warrants a guide of its own, however, so look out for that in the future! 😉</p>https://dvc.org/blog/april-22-heartbeathttps://dvc.org/blog/april-22-heartbeatFri, 15 Apr 2022 00:00:00 GMT<details> <p>This month's Heartbeat image is inspired by Community member Gudmundur Heimisson. Gudmundur submitted some great PRs to update WebHDFS docs pending some other issues in the DVC repo.</p> <p>This image refelcts his Paris area team's view of Château de Vincennes out their company windows!</p> <p>We are grateful for all our Community members' contributions from all around the world!</p> <summary>✨Image Inspo✨</summary> </details> <p>Welcome to April! We have lots to ingest from the AI World and the Community so let's get started with all the building blocks for success!</p> <h2 id="ai-news" style="position:relative;">AI News<a href="#ai-news" aria-label="ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><img src="https://media.giphy.com/media/l0JMrPWRQkTeg3jjO/giphy.gif" alt="Lego Rotate GIF by sheepfilms"></p> <h3 id="the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-off--top-players-to-watch" style="position:relative;">The Future of AI Infrastructure is Becoming Modular: Why Best-of-Breed MLOps Solutions are Taking Off & Top Players to Watch<a href="#the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-off--top-players-to-watch" aria-label="the future of ai infrastructure is becoming modular why best of breed mlops solutions are taking off top players to watch permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/CasberW" target="_blank" rel="nofollow noopener noreferrer"><strong>Casber Wang</strong></a> of <a href="https://twitter.com/SapphireVC" target="_blank" rel="nofollow noopener noreferrer">Sapphire VC</a> recently wrote <a href="https://medium.com/sapphire-ventures-perspectives/the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-fd85c6ca8bcf" target="_blank" rel="nofollow noopener noreferrer">a piece in Medium</a> on the necessary trend of AI infrastructure tooling becoming modular. He notes three types of AI user types, "Off-the-shelfers," "Bet-the-Farmers," and "Rocket Scientists." As the industry matures he makes the case (and we concur) for the need for modular infrastructure tooling to provide AI teams with the most flexible approach as they fine-tune their advancing and ever-growing processes.</p> <blockquote> <p>Where organizations used to seek all-in-one solutions to operationalize machine learning (ML) due to limited in-house resources and expertise, we’re seeing a rise in the demand for modular, best-in-class tooling that equips today’s more robust ML teams with the ability to flexibly run highly-custom and performant ML workloads.</p> </blockquote> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9e3d9a1bac94ef897a27f78d1c41d8a7/39600/clayton-christensen.png" alt="Clayton Christensen's Modularity Theory" title="Clayton Christensen's Modularity Theory" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Clayton Christensen's Modularity Theory (<a href="https://medium.com/sapphire-ventures-perspectives/the-future-of-ai-infrastructure-is-becoming-modular-why-best-of-breed-mlops-solutions-are-taking-fd85c6ca8bcf" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <blockquote> <p>Soon, large data teams will turn to modular toolkits with dozens of solutions that manage different stages of the AI lifecycle. This will be particularly true of the “bet-the-farmers”, who will need customized, best-in-class tools that provide the flexibility that can match their exact challenge.</p> </blockquote> <p>Wang describes the different toolchain groupings in the AI Lifecycle and discusses some of the players in each of them. DVC shows up in the Model Evaluation & Experiment Tracking group, but soon you will see that our tools deliver flexible, modular building blocks for some other pieces of the puzzle.</p> <h2 id="data-distribution-shifts-and-monitoring" style="position:relative;">Data Distribution Shifts and Monitoring<a href="#data-distribution-shifts-and-monitoring" aria-label="data distribution shifts and monitoring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/chipro" target="_blank" rel="nofollow noopener noreferrer"><strong>Chip Huyen's</strong></a> <a href="https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html" target="_blank" rel="nofollow noopener noreferrer">most recent blog post</a> created for the course at Stanford <a href="https://cs329s.stanford.edu/" target="_blank" rel="nofollow noopener noreferrer">CS 329S: Machine Learning Systems Design</a> goes into detail on all things related to data distribution shifts and the monitoring of them. The piece provides great examples to understand concepts such as natural labels, the types of distribution shifts, causes of ML System failure, and the metrics needed to monitor these things to determine when your model is no longer producing the desired results. She discusses tools that can help identify these shifts including logs, dashboards, and alerts, acknowledging the pluses and minuses of each approach. Finally, the emergence of the favoring of the term <em>observability</em> over <em>monitoring</em> is discussed because it is a stronger concept for determining what went wrong with the internal states of a system by observing the external outputs.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b0ee0c27cf955adeab2a7e0b2e35c49c/39600/chip-huyen.png" alt="Drift Detection Algorithms" title="Drift Detection Algorithms" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Drift detection algorithms by open-source package alibi-detect (<a href="https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html#monitoring-toolbox" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>Related to this, you can find a tutorial on how to detect drift and how to correct your model with <a href="https://evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">Evidently AI</a> and DVC, see <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor's</strong></a> latest post on <a href="https://dvc.org/blog/stale-models" target="_blank" rel="nofollow noopener noreferrer">Preventing Stale Models in Production!</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7abf9ed6a309e5b6310932fc37cd4777/39600/stale-model-cover.png" alt="Preventing Stale Models in Production" title="Preventing Stale Models in Production" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Preventing Stale Models in Production (<a href="https://dvc.org/blog/stale-models" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="mlops-is-the-solution-for-machine-learning-and-ai-projects" style="position:relative;">MLOps is the Solution for Machine Learning and AI Projects<a href="#mlops-is-the-solution-for-machine-learning-and-ai-projects" aria-label="mlops is the solution for machine learning and ai projects permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The team at <a href="https://xpresso.ai" target="_blank" rel="nofollow noopener noreferrer"><strong>xpresso.ai</strong></a> created <a href="https://xpresso.ai/resources/blogs/mlops-is-the-solution-for-machine-learning-and-ai-projects/?utm_source=rss&utm_medium=rss&utm_campaign=mlops-is-the-solution-for-machine-learning-and-ai-projects" target="_blank" rel="nofollow noopener noreferrer">this short post</a> about all the facets that make up MLOps. While the tried and true <a href="https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining" target="_blank" rel="nofollow noopener noreferrer">CRISP-DM</a> model for Data Science takes us right up to production, MLOps encompasses considerably more processes that keep and maintain a model in production over time. You can see all of these things highlighted in their image below, providing lots to ponder!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/75f5d2da64a689d6f93dbe57c92a3e97/03346/Machine-Learning-Operations.jpg" alt="Machine Learning Operations" title="Machine Learning Operations" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Machine Learning Operations (<a href="https://xpresso.ai/resources/blogs/mlops-is-the-solution-for-machine-learning-and-ai-projects/?utm_source=rss&utm_medium=rss&utm_campaign=mlops-is-the-solution-for-machine-learning-and-ai-projects" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="kaushik-shakkari-the-three-environments-for-ai-professionals--research-development-and-production" style="position:relative;">Kaushik Shakkari: The three environments for AI Professionals — Research, Development, and Production<a href="#kaushik-shakkari-the-three-environments-for-ai-professionals--research-development-and-production" aria-label="kaushik shakkari the three environments for ai professionals research development and production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b59a0f18177f3039887a5e1efa21fe23/39600/kaushik-shakkari.png" alt="The three environments for AI Professionals - Research, Development, and Production" title="The three environments for AI Professional - Research, Development, and Production =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> If your head is spinning with all the ample facets of the MLOps world as outlined in xpresso.ai's diagram above and where you fit, or in the AI world in general, <a href="https://www.linkedin.com/in/kaushik-shakkari/" target="_blank" rel="nofollow noopener noreferrer"><strong>Kaushik Shakkari</strong></a> wrote <a href="https://kaushikshakkari.medium.com/the-three-environments-for-ai-professionals-research-development-and-production-cffb86dfe533" target="_blank" rel="nofollow noopener noreferrer">this article</a> dividing up the AI space into three environments: Research, Development, and Production. He goes into detail about the type of work, skillsets, and roles found in each. This breakdown can help the reader zero in on where he or she may best fit and be fulfilled in this vast and often confusing space as well as determine a pathway for their career.</p> <h3 id="yashaswi-nayak-continuous-machine-learning---an-introduction-to-cml-iterativeai" style="position:relative;">Yashaswi Nayak: Continuous Machine Learning - An Introduction to CML (Iterative.ai)<a href="#yashaswi-nayak-continuous-machine-learning---an-introduction-to-cml-iterativeai" aria-label="yashaswi nayak continuous machine learning an introduction to cml iterativeai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ed480cafde4f667400e7defddc2f3400/39600/yashaswi-nayak.png" alt="Continuous Machine Learning - An Introduction to CML" title="Continuous Machine Learning - An Introduction to CML =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://twitter.com/YashaswiNayak" target="_blank" rel="nofollow noopener noreferrer"><strong>Yahaswi Nayak</strong></a> writes <a href="https://towardsdatascience.com/continuous-machine-learning-e1ffb847b8da" target="_blank" rel="nofollow noopener noreferrer">a wonderful guide</a> for data scientists and engineers, filled with great story-telling and fun images created by the author about using <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> to provide CI/CD to ML projects. He discusses the usual software development cycle using Git and then follows with the complexities introduced by ML projects. He identifies the reasons why CML is needed in the ML space, and how CML works.</p> <p>Yahaswi gives the scenario of a team working on a classifier problem and how CML would work for different team members tackling different parts of the problem. He details all the questions a CML.yml file answers and takes care of in the workflow. Finally, he lists a number of use cases for readers to try out with CML. We'd love to see some Community members write about some of these use cases that they've put into action!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a71f83c3de3d73e01b53560003789e21/03346/cml-workflow.jpg" alt="Continuous Machine Learning" title="Continuous Machine Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>CML workflow (<a href="https://towardsdatascience.com/continuous-machine-learning-e1ffb847b8da" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="zoumana-keita-mlops--data-and-model-versioning-with-dvc-and-azure-blob-storage" style="position:relative;">Zoumana Keita: MLops — Data And Model Versioning With DVC and Azure Blob Storage<a href="#zoumana-keita-mlops--data-and-model-versioning-with-dvc-and-azure-blob-storage" aria-label="zoumana keita mlops data and model versioning with dvc and azure blob storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you've ever struggled with setting up your Azure Blob Storage with DVC, or you know you will need to in the near future, you're in luck! <a href="https://twitter.com/zoumana_keita_" target="_blank" rel="nofollow noopener noreferrer"><strong>Zoumana Keita</strong></a> shows you how to do just that <a href="https://towardsdatascience.com/large-data-versioning-with-dvc-and-azure-blob-storage-a-complete-guide-b97344827c81" target="_blank" rel="nofollow noopener noreferrer">in this post</a> in <a href="https://towardsdatascience.com" target="_blank" rel="nofollow noopener noreferrer">Towards Data Science.</a> He recently was struggling with the same problem and team member, <a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer">David de la Iglesia Castro</a> came to the rescue on our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord Server.</a> Zoumana was kind enough to write a blog article on the detailed steps for the benefit of the Community.</p> <p>At this point in this Heartbeat, you probably grasp the importance of data, model, and experiment versioning and how DVC easily versions large files in conjunction with Git, which Zoumana describes. But he then takes you on a detailed journey with screenshots of all the steps to get DVC set up with Azure Blob Storage. Many thanks for this tutorial! 🙏🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/large-data-versioning-with-dvc-and-azure-blob-storage-a-complete-guide-b97344827c81" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLOps — Data And Model Versioning With DVC And Azure Blob Storage</h4> <div class="elp-description">Zoumana Keita's detailed tutorial on how to set up Azure Blob Storage with DVC</div> <div class="elp-link">https://towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-04-15/zoumana-keita-19150abaef96d64b94afb3d616881d45.png" alt="MLOps — Data And Model Versioning With DVC And Azure Blob Storage"> </div> </a> </section> <p></p> <h3 id="ahmed-abdullah-perfect-way-of-versioning-models--training-data" style="position:relative;">Ahmed Abdullah: Perfect Way of Versioning Models & Training Data<a href="#ahmed-abdullah-perfect-way-of-versioning-models--training-data" aria-label="ahmed abdullah perfect way of versioning models training data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/ahmed-abdullah-7b1806180/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ahmed Abdullah</strong></a> <a href="https://medium.com/red-buffer/perfect-way-of-versioning-models-training-data-318819a1510d" target="_blank" rel="nofollow noopener noreferrer">wrote this tutorial</a> in Medium about how to get DVC set up to version your data and models with a Google Drive. He takes you in detail through the steps and discusses many of the reasons why this versioning is important to your success as an ML engineer including ever-changing data, effective collaboration with teammates, and the need for keeping data separated from code for security reasons.</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/red-buffer/perfect-way-of-versioning-models-training-data-318819a1510d" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Perfect Way of Versioning Models & Training Data</h4> <div class="elp-description">Ahmed Abdullah's detailed tutorial on using DVC for versioning data, models with a Google Drive</div> <div class="elp-link">https://medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-04-15/ahmed-abdullah-191bb07c64bc8df20e07f777f49e602a.png" alt="Perfect Way of Versioning Models & Training Data"> </div> </a> </section> <p></p> <h2 id="conference-news" style="position:relative;">Conference News<a href="#conference-news" aria-label="conference news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In-person conferences are going on and we are excited to be able to see the Community in person again!</p> <ul> <li><a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer"><strong>Gift Ojeabulu</strong></a> presented at <a href="https://festival.oscafrica.org/" target="_blank" rel="nofollow noopener noreferrer">Open Source Festival 2022</a> in Lagos, Nigeria with the talk: <em>MLOps Exploration with Git & DVC for Machine Learning Project on DAGsHub</em> [<a href="https://speakerdeck.com/giftojabu1/mlops-exploration-with-git-and-dvc-for-machine-learning-project-on-dagshub?slide=2" target="_blank" rel="nofollow noopener noreferrer">slides</a>]</li> <li><a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer"><strong>Antoine Toubhans</strong></a> presented <em>Flexible ML Experiment Tracking System for Python Coders with DVC and Streamlit</em> at PyCon Berlin [<a href="https://github.com/sicara/pycon-2022-dvc-streamlit" target="_blank" rel="nofollow noopener noreferrer">repo, slides</a>]</li> <li><a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Castro Iglesia</strong></a> presented <em>Making MLOps Uncool Again</em> at PyCon Berlin [<a href="https://github.com/iterative/workshop-uncool-mlops" target="_blank" rel="nofollow noopener noreferrer">repo</a>]</li> <li>Next week at <a href="https://odsc.com/boston/" target="_blank" rel="nofollow noopener noreferrer">ODSC East</a>, come see <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> presenting <em>Model Registry with OpenSource Tools: Git, GitHub, and CI/CD</em>; <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> with <em>Preventing Stale Models in Production</em>; and <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> <em>Reproducibility, ML Pipelines, and CI/CD in Computer Vision Projects</em> <a href="https://odsc.com/boston/schedule/" target="_blank" rel="nofollow noopener noreferrer">more info</a></li> <li>Visit us at <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> June 9-10!</li> </ul> <h2 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="online-course-updates" style="position:relative;">Online Course Updates<a href="#online-course-updates" aria-label="online course updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><img src="https://media.giphy.com/media/EdRgVzb2X3iJW/giphy.gif" alt="Surprised Owl GIF"></p> <p>We've grown from 250 students last month to 450 right now!🎉 We are so happy to see you all in the <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">platform</a> learning! What's coming:</p> <ul> <li>We have heard from some of you that you would like captions. We are working on it!</li> <li>Course guide - you will start to see each video have a course guide that will have corresponding resources, explanations, and diagrams for those lessons and be able to take your own notes.</li> </ul> <p>Thank you to all who have provided feedback after each course module! We are going through this feedback, making adjustments, and keeping them in mind for the next course!</p> <h3 id="5-new-hires" style="position:relative;">5 New Hires!🎉<a href="#5-new-hires" aria-label="5 new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/dan-martinec-30739a54/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dan Martinec</strong></a> joins us from the Czech Republic as a field data scientist. Dan first learned about Iterative through using DVC in his work as an ML Engineer. Dan originally studied Control Engineering at CTU in Prague. He graduated with a PhD and has worked in various fields (C++ development at Porsche, mathematical optimization in a small start-up, ML engineer at Avast). When not working Dan enjoys hobby projects in the garden such as building my own storage lodge for firewood, building a wooden composter, implementing a wireless water level reader in the water tank, etc. And after that hard work, he is known to appreciate a good movie. Welcome, Dan!</p> <p><a href="https://www.linkedin.com/in/yury-kasimov-103962b8/" target="_blank" rel="nofollow noopener noreferrer"><strong>Yury Kasimov</strong></a> also joins us from Prague, the Czech Republic as Field Data Scientist. He studied Robotics during his Bachelor's studies and then Artificial Intelligence for his Master degree. Yury worked for some as a part of a university group that helps protect NGOs from different cyber attacks. Prior to joining the team, he spent the last 4 years as an ML engineer at Avast. In his free time, Yury plays a lot of tennis and is learning to play the drums. He speaks English, Czech, Russian, and a bit of Spanish. Bienvenidos, Yury!</p> <p><a href="https://www.linkedin.com/in/chazblack1/" target="_blank" rel="nofollow noopener noreferrer"><strong>Chaz Black</strong></a> joins us as an Account Executive from Atlanta, Georgia. Most recently he worked at H2O.ai leading their business development team for 3 years. When Chaz is not helping clients, you may find him checking out the ever-growing Atlanta food scene and hunting new and exciting coffees and brewing styles. He is also a big audiophile and like many on our team, Chaz enjoys board and video games when he has the time, with his two cats looking over his shoulder. Welcome, Chaz!</p> <p>Many in our Community already know our latest hire, <a href="https://github.com/dacbd" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Barnes</strong></a>, as he has already been a great contributor to our tools! We are excited to welcome him officially to the team as a Software Engineer. Daniel is based in the Seattle, Washington area, having recently moved back after two years in Korea. He has had a varied career path, starting in IT security, programming, as a medic, then cyber in the US military, and then to PACCAR where he discovered our open-source community! When not solving complex software engineering challenges, Daniel has been noted as a bit of an adrenaline junky with "hobbies" like skydiving, paragliding, and motorcycles. Welcome, Daniel!</p> <p><a href="https://www.linkedin.com/in/maximaginsky/" target="_blank" rel="nofollow noopener noreferrer"><strong>Maxim Aginsky</strong></a> joins the team as a Senior Product Designer from Montreal, Canada, marking our 4th employee from the Province of Quebec! Maxim has worn many hats over the years working on Product Development and most recently was the Director of Design for a Montreal Fintech company. You can <a href="https://arrowww.space/" target="_blank" rel="nofollow noopener noreferrer">explore his portfolio here.</a> Welcome, Maxim!</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Even with our amazing new additions to the team, we're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions and share with anyone you think may be interested! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We've been following along on <a href="https://twitter.com/__anavc__" target="_blank" rel="nofollow noopener noreferrer"><strong>Anna's</strong></a> journey through #100daysofcode to learn DVC. And now she's working on a project of her own using Amazon Best Seller data.</p> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/stale-modelshttps://dvc.org/blog/stale-modelsThu, 31 Mar 2022 00:00:00 GMT<admon type="info"> <p>This post hasn't been updated since its release and the repo is currently broken. Our team is in the process of updating it. Nonetheless, the concepts described still hold true and you should be able to follow along with minor changes.</p> </admon> <p>What happens when the model you've worked so hard to get to production becomes stale? Machine learning engineers and data scientists face this problem all the time. You usually have to figure out where the data drift started so you can determine what input data has changed. Then you need to retrain the model with this new dataset.</p> <p>Retraining could involve a number of experiments across multiple datasets, and it would be helpful to be able to keep track of all of them. In this tutorial, we'll walk through how using DVC can help you keep track of those experiments and how this will speed up the time it takes to get new models out to production, preventing stale ones from lingering too long.</p> <h2 id="setting-up-the-project" style="position:relative;">Setting up the project<a href="#setting-up-the-project" aria-label="setting up the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We'll be working with a project from <a href="https://evidentlyai.com/blog/tutorial-1-model-analytics-in-production" target="_blank" rel="nofollow noopener noreferrer">Evidently.ai</a> that demonstrates what it would be like to work with a production model that experiences data drift over time. We'll take this to the next level by adding some automation with a DVC pipeline and share the results with others using DVC Studio.</p> <p>So we'll start by cloning <a href="https://github.com/iterative/stale-model-example" target="_blank" rel="nofollow noopener noreferrer">this repo for the project</a>. This project is based on the one created by <a href="https://github.com/evidentlyai/evidently/blob/main/examples/data_stories/bicycle_demand_monitoring.ipynb" target="_blank" rel="nofollow noopener noreferrer">evidently.ai</a> with some modifications to work with DVC and different datasets.</p> <p>The reason we're adding DVC and Studio to this project is to automate the way our model evaluation pipeline runs and to version our data as we get new data. We'll be able to share and review the results for each experiment run we do. One of the big problems in machine learning is collaboration, so making it easier to share models, data, and results can save your team a lot of time and frustration.</p> <h2 id="set-up-data-drift-reports" style="position:relative;">Set up data drift reports<a href="#set-up-data-drift-reports" aria-label="set up data drift reports permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When the data in production starts to look different from the data that your model was trained, this is called data drift. There are a number of tools that help monitor for data drift like <a href="https://docs.evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">evidently.ai</a> or <a href="https://docs.aporia.com/" target="_blank" rel="nofollow noopener noreferrer">Aporia</a>.</p> <p>Since we're working with Evidently.ai, you can see target drift report when you run the notebook for the initial project they made. Here's what it looks like.</p> <p><img src="https://thumb.tildacdn.com/tild6336-3231-4736-b136-646539326135/-/format/webp/4_week3_pred_actual.png" alt="image of the report showing the target drift"></p> <p>So we see at the end of Week 3 the model is in pretty bad shape. This is where we can bring in DVC to help us get this stale model off of production faster.</p> <h2 id="running-a-training-experiment-to-get-production-up-to-date" style="position:relative;">Running a training experiment to get production up to date<a href="#running-a-training-experiment-to-get-production-up-to-date" aria-label="running a training experiment to get production up to date permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We'll start by taking a year's worth of data and creating a new model. This might give us a more accurate model to push to production than using weekly data. So we'll take all the data from 2011 (because that's the dataset we have to work with) and make our training and testing datasets. Then we'll check this data into DVC, so it can version it with the following commands:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data/train.pkl data/test.pkl </span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> data/.gitignore data/train.pkl.dvc data/test.pkl.dvc</span></code></pre></div> <p>We add the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files to Git to ensure that we are only checking in the metadata for the datasets and not the entire dataset files. Now we can run the entire MLOps pipeline with <a href="https://dvc.org/doc/command-reference/exp/run" target="_blank" rel="nofollow noopener noreferrer">this command</a>:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>This will execute the commands we've defined in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> and it will give us the metrics to evaluate how good the model is. Let's take a look at the metrics so far with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ workspace │ 0.70164 │ 0.51384 │ 20210428 │ 450 │ 64 │ │ main │ 0.60791 │ 0.45758 │ 20210428 │ 375 │ 64 │ │ └── 801fdff [exp-a80c0] │ 0.70164 │ 0.51384 │ 20210428 │ 450 │ 64 │ </span>└─────────────────────────┴──────────┴─────────┴────────────┴─────────────┴─────────────────┘</code></pre></div> <p>This model doesn't have the best metrics, so we can run more experiments to see if tuning hyperparameters will help before we deploy this model to production. Let's change the values of the <code>train.n_est</code> and <code>train.n_est</code> hyperparameters. We'll <a href="https://dvc.org/doc/user-guide/experiment-management" target="_blank" rel="nofollow noopener noreferrer">run several experiments</a> with different values and it will produce a table similar to this:</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ workspace │ 0.43501 │ 0.79082 │ 20210428 │ 475 │ 28 │ │ main │ 0.60791 │ 0.45758 │ 20210428 │ 375 │ 64 │ │ ├── 78d29aa [exp-f06bb] │ 0.43501 │ 0.79082 │ 20210428 │ 475 │ 28 │ │ ├── 8fb41cf [exp-1323d] │ 0.42796 │ 0.80841 │ 20210428 │ 425 │ 28 │ │ ├── 434a82f [exp-63459] │ 0.36044 │ 0.87037 │ 20210428 │ 350 │ 28 │ │ ├── 549586e [exp-ceb6d] │ 0.61998 │ 0.4306 │ 20210428 │ 350 │ 64 │ │ ├── fbf8760 [exp-affe2] │ 0.68824 │ 0.50067 │ 20210428 │ 425 │ 64 │ │ ├── 732ab92 [exp-f8e8d] │ 0.65138 │ 0.49431 │ 20210428 │ 500 │ 64 │ │ └── 801fdff [exp-a80c0] │ 0.70164 │ 0.51384 │ 20210428 │ 450 │ 64 │ </span>└─────────────────────────┴──────────┴─────────┴────────────┴─────────────┴─────────────────┘</code></pre></div> <p>We've run a few experiments with a different hyperparameter value each time and it looks like <code>exp-63459</code> is the best one out of them based on both average precision and the ROC-AUC value. So we'll apply this experiment to our workspace and choose this model as the one that will go to production. To apply the experiment, we'll run the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-c85c3</span></code></pre></div> <p>This will update the workspace with the exact code, data, and hyperparameters that were used in that particular experiment. So we can commit these changes to Git and we'll have a reference to everything we need for this exact model. Now let's say we have deployed this to production and it's been a great model for almost another year, then we start noticing data drift again.</p> <h2 id="running-more-training-experiments-with-new-data" style="position:relative;">Running more training experiments with new data<a href="#running-more-training-experiments-with-new-data" aria-label="running more training experiments with new data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>That means it's time to update our dataset with the latest data from production and that will include all the data on bike sharing in 2012 (because this is the newer data we have to train with). DVC will note the changes in the data and create a new version record for the updated data automatically.</p> <p>Next we'll run a new experiment in the project with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>Then we can take a look at the metrics with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span></span></code></pre></div> <p>Since we cleared our workspace by pushing the changes to Git, we'll have a fresh table to look at. Now you should see a table similar to this:</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> ┃ <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.seed<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> ┃ <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ workspace │ 0.42526 │ 0.82722 │ 20210428 │ 400 │ 28 │ │ main │ 0.69744 │ 0.63056 │ 20210428 │ 475 │ 32 │ │ ├── e76a89d [exp-7d207] │ 0.42526 │ 0.82722 │ 20210428 │ 400 │ 28 │ │ ├── 2a6d647 [exp-7526d] │ 0.74411 │ 0.65808 │ 20210428 │ 400 │ 32 │ │ ├── 467fd3d [exp-dfabd] │ 0.71431 │ 0.6267 │ 20210428 │ 450 │ 32 │ │ ├── 2a2171c [exp-45493] │ 0.58291 │ 0.49201 │ 20210428 │ 350 │ 48 │ │ └── 683dc49 [exp-2649a] │ 0.58421 │ 0.5783 │ 20210428 │ 475 │ 48 │ </span>└─────────────────────────┴──────────┴─────────┴────────────┴─────────────┴─────────────────┘</code></pre></div> <p>Having the updated dataset made a huge difference in the metrics, and it looks like this model has a different set of hyperparameters that perform well. Now that we have all of the experiments with both the old and new datasets, this is a good time to share the results with your coworkers and get some feedback.</p> <h2 id="viewing-experiment-results-in-dvc-studio" style="position:relative;">Viewing experiment results in DVC Studio<a href="#viewing-experiment-results-in-dvc-studio" aria-label="viewing experiment results in dvc studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Because we already have DVC set up in this project, we can run as many experiments as we need to, and it will track which datasets we're working with, the code changes that we make, and it'll let us look at all the results from each experiment in Studio.</p> <p>If you go to <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">Iterative Studio</a>, you'll be prompted to connect to your GitHub/GitLab account and you'll be able to choose the repo for this project. Once you're connected, you should be able to see all the experiments you've pushed to your Git history.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/928412502573b85d1c9da6ef8b136d4c/39600/stale_models_in_studio.png" alt="example of plots and results in DVC Studio" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>You can give others on your team access to this, and they'll be able to run new experiments and see the results right in the browser. This is a great tool to use to discuss the next best steps in your model training before you're ready to deploy.</p> <h2 id="deploy-new-model-to-production" style="position:relative;">Deploy new model to production<a href="#deploy-new-model-to-production" aria-label="deploy new model to production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The output of our training stage is the file for the <code>model.pt</code>. Now all we need to do is get this to our production environment. That could be a web API that returns results in real-time, or you could do some kind of batch prediction. Regardless of how you deploy to production, you now have a model that's been updated to account for the previous data drift.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now you just have to keep an eye on this new model to make sure that it does stray too far from the results you expect. This is one of the processes you can use to keep your production models from going stale. You could even automate some parts of this process if you know what your thresholds are!</p>https://dvc.org/blog/march-22-community-gemshttps://dvc.org/blog/march-22-community-gemsWed, 30 Mar 2022 00:00:00 GMT<h3 id="what-is-the-difference-between-using-dvc-exp-run-and-dvc-repro" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/939070512322195456" target="_blank" rel="nofollow noopener noreferrer">What is the difference between using <code>dvc exp run</code> and <code>dvc repro</code>?</a><a href="#what-is-the-difference-between-using-dvc-exp-run-and-dvc-repro" aria-label="what is the difference between using dvc exp run and dvc repro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a really good question from @v2.03.99!</p> <p>When you use <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>, DVC automatically tracks each experiment run. Using <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> leaves it to the user to track each experiment.</p> <p>You can learn how <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> uses custom Git refs to track experiments in this <a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">blog post</a> and you can see a quick technical overview in <a href="https://dvc.org/doc/user-guide/experiment-management/experiments-overview" target="_blank" rel="nofollow noopener noreferrer">the docs here</a>.</p> <h3 id="what-is-a-good-way-to-debug-dvc-stages-in-vscode" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/939269709780643861" target="_blank" rel="nofollow noopener noreferrer">What is a good way to debug DVC stages in VSCode?</a><a href="#what-is-a-good-way-to-debug-dvc-stages-in-vscode" aria-label="what is a good way to debug dvc stages in vscode permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>A great question here from @quarkquark!</p> <p>You can debug in VSCode by following the steps below:</p> <ul> <li>Install the <code>debugpy</code> package.</li> <li>Navigate to <code>"Run and Debug" > "Remote Attach" > localhost > someport</code>.</li> <li>In a terminal in VSCode, <code>python -m debugpy --listen someport --wait-for-client -m dvc mycommand</code></li> </ul> <p>This should help you debug the stages in your pipeline in the IDE and you can find <a href="https://github.com/iterative/dvc/wiki/Debugging-DVC-interactively" target="_blank" rel="nofollow noopener noreferrer">more details here</a>.</p> <h3 id="is-there-a-way-to-list-what-files-and-ideally-additional-info-like-location-md5-etc-are-within-a-directory-tracked-by-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/940318136568258650" target="_blank" rel="nofollow noopener noreferrer">Is there a way to list what files (and ideally additional info like location, MD5, etc) are within a directory tracked by DVC?</a><a href="#is-there-a-way-to-list-what-files-and-ideally-additional-info-like-location-md5-etc-are-within-a-directory-tracked-by-dvc" aria-label="is there a way to list what files and ideally additional info like location md5 etc are within a directory tracked by dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for asking @CarsonM!</p> <p>You should be able to use DVC to list the directory contents of your DVC remotes without pulling the repo. Here's an example of the command you can run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc list</span> https://github.com/iterative/dataset-registry/ fashion-mnist/raw</span></code></pre></div> <h3 id="if-we-have-multiple-datasets-is-it-recommended-to-have-1-remote-per-dataset-or-to-have-1-remote-and-let-dvc-handle-the-paths" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/943213340195434546" target="_blank" rel="nofollow noopener noreferrer">If we have multiple datasets, is it recommended to have 1 remote per dataset or to have 1 remote and let DVC handle the paths?</a><a href="#if-we-have-multiple-datasets-is-it-recommended-to-have-1-remote-per-dataset-or-to-have-1-remote-and-let-dvc-handle-the-paths" aria-label="if we have multiple datasets is it recommended to have 1 remote per dataset or to have 1 remote and let dvc handle the paths permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a really interesting question from @BrownZ!</p> <p>It really depends on your use case. Separated remotes might be useful if you want to have granular control over permissions for each dataset.</p> <p>In general, we would suggest a single remote and setting up a <a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry</a> to handle the different datasets through DVC.</p> <h3 id="is-there-a-mailing-list-for-subscribing-to-cml-releases" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/939215540591927337" target="_blank" rel="nofollow noopener noreferrer">Is there a mailing list for subscribing to CML releases?</a><a href="#is-there-a-mailing-list-for-subscribing-to-cml-releases" aria-label="is there a mailing list for subscribing to cml releases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It's awesome community members like @pria want to keep up with our releases!</p> <p>You can follow all of our releases via GitHub notifications. You can browse release notes at <a href="https://github.com/iterative/cml/releases" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/cml/releases</a>. You can also subscribe to release updates by clicking the <code>Watch</code> button in the top-right, navigating to <code>Custom</code>, and checking the <code>Releases</code> option.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 166px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eb18c4360f0c57b120be336596dc0a9d/ca0b1/cml-release-follow.png" alt="the checkbox you need to check in GitHub to follow CML releases" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="does-cml-send-comment-work-for-azure-devops-repositories" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/947986936994353293" target="_blank" rel="nofollow noopener noreferrer">Does <code>cml-send-comment</code> work for azure devops repositories?</a><a href="#does-cml-send-comment-work-for-azure-devops-repositories" aria-label="does cml send comment work for azure devops repositories permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @1cybersheep1!</p> <p>Currently, the supported Source Code Management tools are GitHub, GitLab, and Bitbucket. Other SCMs may be a part of the roadmap later on.</p> <h3 id="if-my-model-is-training-on-a-self-hosted-local-runner-and-i-already-have-a-shared-dvc-cache-set-up-on-the-same-machine-is-there-a-good-way-for-my-github-workflow-to-access-that-cache-instead-of-having-to-redownload-it-all-from-cloud-storage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/951240652035883008" target="_blank" rel="nofollow noopener noreferrer">If my model is training on a self-hosted, local runner, and I already have a shared DVC cache set up on the same machine, is there a good way for my GitHub workflow to access that cache instead of having to redownload it all from cloud storage?</a><a href="#if-my-model-is-training-on-a-self-hosted-local-runner-and-i-already-have-a-shared-dvc-cache-set-up-on-the-same-machine-is-there-a-good-way-for-my-github-workflow-to-access-that-cache-instead-of-having-to-redownload-it-all-from-cloud-storage" aria-label="if my model is training on a self hosted local runner and i already have a shared dvc cache set up on the same machine is there a good way for my github workflow to access that cache instead of having to redownload it all from cloud storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Excellent question from @luke_imm!</p> <p>In GitHub, you can mount volumes to your container, but you have to declare them within the <a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-running-a-job-within-a-container" target="_blank" rel="nofollow noopener noreferrer">workflow YAML</a></p> <hr> <p><img src="https://media.giphy.com/media/3o6Mbnll2gudglC3HG/giphy.gif" alt="Season 3 Race GIF"></p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/march-22-heartbeathttps://dvc.org/blog/march-22-heartbeatThu, 17 Mar 2022 00:00:00 GMT<h1 id="on-the-war-in-ukraine-" style="position:relative;">On the war in Ukraine 🇺🇦<a href="#on-the-war-in-ukraine-" aria-label="on the war in ukraine permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>While the war in Ukraine has impacted the world, it has also greatly impacted our company as we have team members living in Ukraine and Russia, and many with family ties to both. Our hearts are with our Iterative family in Ukraine and we are committed to doing everything we can to support the safety of our Ukrainian, as well as the transition of our Russian colleagues during this crisis.</p> <p>We as a company are against this war. We have donated to the humanitarian efforts to help the people of Ukraine and are matching our team members' donations as well. We are proud of the perseverance, care, and support coming from our team at this time.</p> <p>If you are able, we ask that you consider these resources as ways to help. Our hope is that the world will find a quick and peaceful end to this war and Ukraine will be restored, even stronger than before. 🇺🇦</p> <h2 id="donations" style="position:relative;">🪙 Donations<a href="#donations" aria-label="donations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://www.reddit.com/r/ukraine/comments/s6g5un/want_to_support_ukraine_heres_a_list_of_charities/" target="_blank" rel="nofollow noopener noreferrer">A list of charities with direct connections to Ukrainian people endorsed</a> by the <a href="https://kyivindependent.com/" target="_blank" rel="nofollow noopener noreferrer">Kyiv Independent</a>. Everything on this list except for the "Charities that help the war effort” section is for humanitarian efforts only.</li> <li><a href="https://bank.gov.ua/en/news/all/natsionalniy-bank-vidkriv-rahunok-dlya-gumanitarnoyi-dopomogi-ukrayintsyam-postrajdalim-vid-rosiyskoyi-agresiyi" target="_blank" rel="nofollow noopener noreferrer">Humanitarian Assistance to Ukrainians by National Bank of Ukraine</a></li> <li><a href="https://www.unicefusa.org/?form=ukraine-emergency-match" target="_blank" rel="nofollow noopener noreferrer">UNICEF USA</a> (2x additional match)</li> <li><a href="https://www.unicef.org.uk/donate/donate-now-to-protect-children-in-ukraine/" target="_blank" rel="nofollow noopener noreferrer">UNICEF</a> UK</li> <li><a href="https://donate.unrefugees.org.uk/general/~my-donation?_cv=1" target="_blank" rel="nofollow noopener noreferrer">UNHCR</a></li> <li><a href="https://donate.redcrossredcrescent.org/ua/donate/~my-donation?_cv=1" target="_blank" rel="nofollow noopener noreferrer">RedCross Ukraine</a> (there are some concerns about this org - see <a href="https://twitter.com/ptico/status/1502192685364531204" target="_blank" rel="nofollow noopener noreferrer">one</a>, <a href="https://twitter.com/KyivIndependent/status/1501136976447168512" target="_blank" rel="nofollow noopener noreferrer">two</a>)</li> <li><a href="https://donate.redcross.org.uk/appeal/ukraine-crisis-appeal" target="_blank" rel="nofollow noopener noreferrer">RedCross UK</a></li> <li><a href="https://give.internationalmedicalcorps.org/page/99837/donate/1" target="_blank" rel="nofollow noopener noreferrer">International Medical Corps</a></li> <li><a href="https://www.wfp.org/support-us/stories/ukraine-appeal" target="_blank" rel="nofollow noopener noreferrer">WFP</a></li> <li><a href="https://www.ukrainecharity.org/war-crisis-692518.html" target="_blank" rel="nofollow noopener noreferrer">UKRAINECHARITY</a></li> <li><a href="https://novaukraine.org/" target="_blank" rel="nofollow noopener noreferrer">NOVA UKRAINE</a></li> <li><a href="https://www.gofundme.com/f/support-ukrainian-refugees-arriving-in-poland" target="_blank" rel="nofollow noopener noreferrer">GOFUNDME / Support Ukrainian Refugees Arriving In Poland</a></li> <li><a href="https://www.doctorswithoutborders.org/what-we-do/countries/ukraine" target="_blank" rel="nofollow noopener noreferrer">Doctors Without Borders</a></li> <li><a href="https://support.savethechildren.org/site/Donation2?df_id=5751&mfc_pref=T&5751.donation=form1" target="_blank" rel="nofollow noopener noreferrer">Save the Children</a></li> <li><a href="https://www.icrc.org/en/donate/ukraine" target="_blank" rel="nofollow noopener noreferrer">ICRC</a></li> <li><a href="https://secure.projecthope.org/site/SPageNavigator/2022_02_Ukraine_Response_Web_UNR.html&s_subsrc=oth" target="_blank" rel="nofollow noopener noreferrer">Project Hope</a></li> <li><a href="https://www.flexport.org/donate-now" target="_blank" rel="nofollow noopener noreferrer">Flexport</a></li> </ul> <h2 id="️other-ways-to-help" style="position:relative;">❤️‍🩹 Other ways to help<a href="#%EF%B8%8Fother-ways-to-help" aria-label="️other ways to help permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li><a href="https://icanhelp.host/" target="_blank" rel="nofollow noopener noreferrer">I Can Help (hosting)</a></li> <li><a href="https://www.airbnb.org/help-ukraine" target="_blank" rel="nofollow noopener noreferrer">Airbnb - host a refugee</a></li> </ul> <hr> <h1 id="aiml-news" style="position:relative;">AI/ML News<a href="#aiml-news" aria-label="aiml news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/5YiRHZtcSeiEyOpSV7/giphy.gif" alt="Excited Marie Kondo GIF"></p> <h2 id="mihail-eric-mlops-is-a-mess-but-thats-to-be-expected" style="position:relative;">Mihail Eric: MLOps is a Mess But That's to be Expected<a href="#mihail-eric-mlops-is-a-mess-but-thats-to-be-expected" aria-label="mihail eric mlops is a mess but thats to be expected permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/mihail_eric" target="_blank" rel="nofollow noopener noreferrer"><strong>Mihail Eric</strong></a> writes a long, but <em>really worth it</em> piece entitled <a href="https://www.mihaileric.com/posts/mlops-is-a-mess/" target="_blank" rel="nofollow noopener noreferrer">MLOps is a Mess But That’s to be Expected.</a> In it he discusses the allure of seeking a machine learning career, only run smack into the giant wall of learning that encompasses the space, not the least of which is the multitude of tools to pick through once you get there. The state of machine learning is reviewed and some history of DevOps for perspective on MLOps is added.<br> You will find advice for newcomers and some final, thorough, thoughts and predictions especially as they relate to “ML at a reasonable scale” companies.<br> Definitely worth your review!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d7d7813635625b97f263ea955bb7f77c/39600/hype-cycle-mihail-eric.png" alt="Gartner Hype cycle for MLOps" title="Gartner Hype cycle for MLOps" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Gartner Hype cycle for MLOps (<a href="https://www.mihaileric.com/posts/mlops-is-a-mess/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h1 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="kevin-lu-learn-how-to-use-data-version-control-to-remove-the-third-wheel-from-your-relationship" style="position:relative;">Kevin Lu: Learn how to use Data Version Control to remove the third wheel from your relationship<a href="#kevin-lu-learn-how-to-use-data-version-control-to-remove-the-third-wheel-from-your-relationship" aria-label="kevin lu learn how to use data version control to remove the third wheel from your relationship permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8394d1b2ac723900cba241f2483b5cf5/ab158/kevin.png" alt="Learn how to use Data Version Control to remove the third wheel from your relationship" title="Learn how to use Data Version Control to remove the third wheel from your relationships =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> In <a href="https://medium.com/@kevinylu/learn-how-to-use-data-version-control-to-remove-the-third-wheel-from-your-relationship-ce4c2afa649c" target="_blank" rel="nofollow noopener noreferrer">this hilarious post,</a> <a href="https://medium.com/@kevinylu" target="_blank" rel="nofollow noopener noreferrer"><strong>Kevin Lu</strong></a> teaches us how to use DVC to enable us to disconnect from our unhealthy addictive relationships with our computers and make room for more human relationships! You don't want to miss the humor, productivity and wisdom here, all while helping you understand how each of DVC's commands help your machine learning engineering exploits.</p> <h2 id="thanakorn-panyapiang-putting-a-machine-learning-model-into-production-with-google-cloud-platform-and-dvc" style="position:relative;">Thanakorn Panyapiang: Putting A Machine Learning model into production with Google Cloud Platform and DVC<a href="#thanakorn-panyapiang-putting-a-machine-learning-model-into-production-with-google-cloud-platform-and-dvc" aria-label="thanakorn panyapiang putting a machine learning model into production with google cloud platform and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Are you a data scientist new to putting models into production?<br> <a href="https://towardsdatascience.com/putting-machine-learning-model-into-production-with-google-cloud-platform-and-dvc-f6a22cdcf4a5" target="_blank" rel="nofollow noopener noreferrer">In this piece</a> <a href="https://www.linkedin.com/in/tpanyapiang/" target="_blank" rel="nofollow noopener noreferrer"><strong>Thanakorn Panyapiang</strong></a> describes various model deployment strategies to put projects into production including model-as-service, batch prediction and model-on-edge. In his example he uses a batch prediction approach with an image segmentation model to identify clouds. He uses DVC as a model registry with Google Cloud storage and GitHub actions to automate the Cloud Functions deployment. See all the steps he outlines in his piece to get real value out of your machine learning projects.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8fb6217eabb6588ca79eeb4ebe471cd1/03346/panyapiang.jpg" alt="Data Pipeline" title="Data Pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Data Pipeline (<a href="https://towardsdatascience.com/putting-machine-learning-model-into-production-with-google-cloud-platform-and-dvc-f6a22cdcf4a5" target="_blank" rel="nofollow noopener noreferrer">Source link: Author</a>)</em></p> <h2 id="matthew-upson-mlops-for-conversational-ai-with-rasa-dvc-and-cml-partii" style="position:relative;">Matthew Upson: MLOps for Conversational AI with Rasa, DVC, and CML (PartII)<a href="#matthew-upson-mlops-for-conversational-ai-with-rasa-dvc-and-cml-partii" aria-label="matthew upson mlops for conversational ai with rasa dvc and cml partii permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In the <a href="https://dvc.org/blog/december-21-heartbeat" target="_blank" rel="nofollow noopener noreferrer">December Heartbeat,</a> I told you about <a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Upson's</strong></a> first post in his series on using DVC, CML and Rasa together. <a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-ii-3a70fe2f357d" target="_blank" rel="nofollow noopener noreferrer">In this second post</a> he goes through some Rasa basics and gets the DVC pipeline setup, with its train and test stages, params, dependencies, outs and metrics. He also covers syncing with DVC, making changes, the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command, the <code>.dvc-lock</code> file, and pushing to remote storage. We're looking forward to the next installment when we will see how CML can be used to automatically train the model.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d5f1cb82c955a6ecc1ddb237ed888689/39600/upson.png" alt="Rasa DVC metrics diff" title="Rasa DVC metrics diff" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC metrics diff in Rasa project (<a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-ii-3a70fe2f357d" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="sibanjan-das-mlops-for-enterprise-ai" style="position:relative;">Sibanjan Das: MLOps for Enterprise AI<a href="#sibanjan-das-mlops-for-enterprise-ai" aria-label="sibanjan das mlops for enterprise ai permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/sibanjan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Sibanjan Das</strong></a> notes the trending of the MLOps keyword in <a href="https://dzone.com/articles/mlops-for-enterprise-ai" target="_blank" rel="nofollow noopener noreferrer">his piece</a> in <a href="https://dzone.com" target="_blank" rel="nofollow noopener noreferrer">DZone.</a> Sibanjan gives an overview of MLOps and how it supports the AI/ML ecosystem to deliver return on investment for ML projects. He reviews the components of MLOps, including automated ML model building pipelines, model serving, model version control, model/data monitoring, and security and governance. He also discusses the MLOps maturity models of Google and Microsoft (see below). I found this part especially interesting as it mirrors what we see in our Community and how they develop using our tools as well. Finally, he outlines some tools that help in the process, including DVC.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/69402d709c22ad8d698a6d67b39a2bf4/03346/das.jpg" alt="Comparing Google's and Microsoft's maturity models" title="Comparing Google's and Microsoft's maturity models" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Comparing Google's and Microsoft's maturity models (<a href="https://dzone.com/articles/mlops-for-enterprise-ai" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="jagreet-kaur-implementing-devops-for-machine-learning---a-quick-guide" style="position:relative;">Jagreet Kaur: Implementing DevOps for Machine Learning - A Quick Guide<a href="#jagreet-kaur-implementing-devops-for-machine-learning---a-quick-guide" aria-label="jagreet kaur implementing devops for machine learning a quick guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cee3733976e3c8af79f13a22284e6f55/39600/jagreet-kaur.png" alt="Tensorflow, PyTorch, DVC, Docker, CI/CD" title="Continuous Development Life Cycle Guide from Xenostack =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://www.linkedin.com/in/jagreetkaur/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jagreet Kaur</strong></a> of <a href="https://www.xenonstack.com/" target="_blank" rel="nofollow noopener noreferrer">Xenonstack</a> authors <a href="https://www.xenonstack.com/blog/devops-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">a guide</a> on applying DevOps to machine learning and generally what the continuous development life cycle is as it relates to machine learning projects. Jagreet goes over all the fun continuous topics including, continuous integration, continuous testing, continuous retraining, and continuous deployment. She gives an overview of the use of Tensor Flow, PyTorch, and Docker, as well as DVC for version control, experiment management deployment, and collaboration. Additional resources from Xenonstack are provided for further review.</p> <h3 id="yuqi-li-why-mlops-should-be-open-source" style="position:relative;">Yuqi Li: Why MLOps should be Open Source<a href="#yuqi-li-why-mlops-should-be-open-source" aria-label="yuqi li why mlops should be open source permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/270790f920a14247fcc2e0ea0e2f80e6/03346/yuqi-li.jpg" alt="Why MLOps Tools should be Open Source" title="Why MLOps Tools should be Open Source =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <a href="https://www.linkedin.com/in/yuqiliofficial/" target="_blank" rel="nofollow noopener noreferrer"><strong>Yuqi Li</strong></a> <a href="https://towardsdatascience.com/why-mlops-tools-should-be-open-source-5ad696463f54" target="_blank" rel="nofollow noopener noreferrer">in this opinion piece,</a> in <a href="https://towardsdatascience.com/" target="_blank" rel="nofollow noopener noreferrer">Towards Data Science.</a> overviews the meaning and components of MLOps and identifies a number of good open-source tools in the space which of course includes DVC. He also outlines a number of reasons why MLOps should be open source. Among the reasons making the cut:</p> <ol> <li>Cost-Effectiveness</li> <li>Ownership</li> <li>No privacy concern</li> <li>Build Community around the tool Examine these reasons to determine if open source makes sense for your MLOps work. We think you will.</li> </ol> <h2 id="and-speaking-of-community" style="position:relative;">And speaking of Community…<a href="#and-speaking-of-community" aria-label="and speaking of community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h2 id="mert-bozkir-community-driven-learning" style="position:relative;">Mert Bozkir: Community-Driven Learning<a href="#mert-bozkir-community-driven-learning" aria-label="mert bozkir community driven learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you’ve been in our Discord server, been to one of our Meetups, or interacted with us on Twitter, you’ve surely come across DVC Community All-Star <a href="https://github.com/mertbozkir" target="_blank" rel="nofollow noopener noreferrer"><strong>Mert Bozkir</strong></a>. Mert has written <a href="https://medium.com/@mertbozkir/community-driven-learning-2481103aa190" target="_blank" rel="nofollow noopener noreferrer">a great piece</a> Entitled <em>Community Driven Learning</em> and describes how it is the best way to learn. He outlines his reasoning for this including the support, encouragement, and motivation you can get from the Community to be persistent in your learning efforts. He also includes eight communities that are great for learning, with invites included. Be sure to check it out!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6ec351cad71685ffed5d067d74c5ac38/03346/community.jpg" alt="Community Driven Learning" title="Community Driven Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Community Driven Learning (<a href="https://unsplash.com/@john_cameron" target="_blank" rel="nofollow noopener noreferrer">Source link: Unsplash by john_cameron</a>)</em></p> <h2 id="and-speaking-of-learning" style="position:relative;">And speaking of learning…<a href="#and-speaking-of-learning" aria-label="and speaking of learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/3ohuAxV0DfcLTxVh6w/giphy.gif" alt="GIF by Star Wars"></p> <h2 id="online-courses-updates" style="position:relative;">Online Course(s) Updates<a href="#online-courses-updates" aria-label="online courses updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li> <p>We now have over <strong>250</strong> students taking the course and <strong>10</strong> students that have completed the course! 🎉 Thank you to all who have given us feedback. We are actively working on making adjustments to the course and improving the next one.</p> </li> <li> <p>We have a new look! The website for our online course, Iterative Tools for Data Scientists and Analysts has been updated to be more streamlined to more clearly identify what our students need in the course!</p> </li> <li> <p>We have already begun working on the second course which will be more advanced (remember those maturity models outlined in the article from DZone above?) and will cover scenarios with CML. We are also working on creating an ebook for each video that will provide relevant information, diagrams, and links with the video content instead of being batched at the end of the module. The ebook format will also let you take your own notes as you study!</p> </li> </ul> <h2 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><img src="https://media.giphy.com/media/lQ0LC603dA96Gs2Hfx/giphy.gif" alt="My Team GIF by The Voice"></p> <p><a href="https://www.linkedin.com/in/michael-moynihan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Mike Moynihan</strong></a> joins us from Brooklyn, NY as an Account Executive. He previously worked at Code Climate as the Manager of Business Development and an Account Executive. Mike's really into biking and will be participating in the 5-Boro Bike Tour in NYC this year. He's also a baker and has been baking bread and other baked goods consistently for about 3 years now. Finally, when not working or biking or baking, you may find him playing one of the video or board games in his 500-strong collection.</p> <p><a href="https://www.linkedin.com/in/rcdewit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob De Wit</strong></a> joins our team from Utrecht, the Netherlands as a Developer Advocate. Rob's first focus will be on developing those new ebooks for our new online courses mentioned above. He has a background in Information Sciences and previously worked at bol.com and Devoteam. When not working, Rob likes photo and video editing, board games, organizing meetups, and hiking (the Peaks of the Balkans are on his bucket list).<br> He also stays busy by learning Spanish and dabbling in local politics.</p> <h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="march-office-hours" style="position:relative;">March Office Hours!<a href="#march-office-hours" aria-label="march office hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Be sure to join us at the <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" target="_blank" rel="nofollow noopener noreferrer">March Office Hours Meetup,</a> where <a href="https://github.com/PythonFZ/" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabian Zills</strong>,</a> PhD student at <a href="https://www.uni-stuttgart.de/en/" target="_blank" rel="nofollow noopener noreferrer">University of Stuttgart,</a> will present his ZnTrack ("zinc track") project which creates, runs and benchmarks DVC pipelines in Python and Jupyter Notebooks.<br> <a href="https://github.com/zincware/ZnTrack" target="_blank" rel="nofollow noopener noreferrer">Find the repo here!</a></p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">March Office Hours - ZnTrack</h4> <div class="elp-description">RSVP for DVC Office Hours - ZnTrack - Create, Visualize, Run and Benchmark DVC Pipelines in Python & Jupyter Notebooks </div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-03-17/office-hours-meetup-dcb241606953b111ec130fa158c4527b.png" alt="March Office Hours - ZnTrack"> </div> </a> </section> <p></p> <h2 id="conferenceshackathons" style="position:relative;">Conferences/Hackathons<a href="#conferenceshackathons" aria-label="conferenceshackathons permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ul> <li>We will be sponsoring <a href="https://odsc.com/boston/" target="_blank" rel="nofollow noopener noreferrer">ODSC East</a> and <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> this year, so if you are attending, we'd love to meet you IRL! Stop by our booth!</li> <li><a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> will be speaking at <a href="https://2022.pythonwebconf.com/" target="_blank" rel="nofollow noopener noreferrer">PythonWeb Conference</a> March 22nd on "Using Reproducible Experiments to Creat Better Machine Learning Models."</li> <li><a href="https://github.com/daavoo" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia Castro</strong></a> will be presenting his workshop "Making MLOps Uncool Again" at <a href="https://mlopsworld.com/newyork/" target="_blank" rel="nofollow noopener noreferrer">MLOps World New York</a> on March 29th and at <a href="https://2022.pycon.de/" target="_blank" rel="nofollow noopener noreferrer">PyCon Berlin</a> April 11th.</li> <li>Community member <a href="https://twitter.com/GiftOjeabulu_" target="_blank" rel="nofollow noopener noreferrer"><strong>Gift Ojeabulu</strong></a> will be giving a talk on "MLops Exploration with Git and DVC for Machine Learning Project" at <a href="https://festival.oscafrica.org/" target="_blank" rel="nofollow noopener noreferrer">Open Source Festival 2022</a> March 24-26.</li> <li><a href="https://www.battery.dev/" target="_blank" rel="nofollow noopener noreferrer">BatteryDev Hackathon</a> will take place next week and <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> will hold an Office Hours for those needing help with DVC on March 21st</li> <li><a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer"><strong>Antoine Toubhans</strong></a> will be presenting his DVC integration with Streamlit at <a href="https://2022.pycon.de/" target="_blank" rel="nofollow noopener noreferrer">PyCon Berlin</a> as well.</li> </ul> <h2 id="-new-docs" style="position:relative;">📖 New Docs<a href="#-new-docs" aria-label=" new docs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="cml-ci" style="position:relative;">CML CI<a href="#cml-ci" aria-label="cml ci permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>CML has a new command line reference that lets you prepare the Git repository for CML operations. For more info on <code>cml ci</code>, <a href="https://cml.dev/doc/ref/ci#command-reference-ci" target="_blank" rel="nofollow noopener noreferrer">check out the docs</a></p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Even with our amazing new additions to the team, we're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions and share with anyone you think may be interested! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We were really excited to the the <a href="https://www.sicara.ai/" target="_blank" rel="nofollow noopener noreferrer">Sicara</a> team all decked out in their DVC swag this month in this Tweet. If you haven't seen the video of <a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhans</a> integration with Streamlit, you can <a href="https://www.youtube.com/watch?v=F318uN01v7M&t=2s" target="_blank" rel="nofollow noopener noreferrer">see it on our YouTube channel</a> or catch the presentation at this year's <a href="https://2022.pycon.de/" target="_blank" rel="nofollow noopener noreferrer">PyCon Berlin.</a></p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Swag received :) Thanks <a href="https://twitter.com/DVCorg">@DVCorg</a> !! We love DVC at @sicara_fr 👉 keep up the great work 👍 <a href="https://twitter.com/_Okamille">@_Okamille</a> <a href="https://twitter.com/e_vignon">@e_vignon</a> <a href="https://twitter.com/cpierrehenri">@cpierrehenri</a> <a href="https://twitter.com/JPro20">@JPro20</a> <a href="https://twitter.com/SoulMathieu">@SoulMathieu</a> <a href="https://twitter.com/Arnault_Chaz">@Arnault_Chaz</a> <a href="https://t.co/RbFuCMG4NS">pic.twitter.com/RbFuCMG4NS</a></p>— Antoine Toubhans (@AntoineToubhans) <a href="https://twitter.com/AntoineToubhans/status/1497254983963660292">February 25, 2022</a></blockquote> <p>How do you get some DVC swag you ask? Write us some great content, contribute to our tools, give a presentation at one of our Meetups! We'd love to have you!</p> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/february-22-community-gemshttps://dvc.org/blog/february-22-community-gemsMon, 28 Feb 2022 00:00:00 GMT<h3 id="how-can-i-delete-dvc-tracked-files-from-cloud-storage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/927618225989111880" target="_blank" rel="nofollow noopener noreferrer">How can I delete DVC-tracked files from cloud storage?</a><a href="#how-can-i-delete-dvc-tracked-files-from-cloud-storage" aria-label="how can i delete dvc tracked files from cloud storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @fireballpoint1!</p> <p>You can find the best way to delete files from your cloud storage in <a href="https://dvc.org/doc/command-reference/gc#removing-data-in-remote-storage" target="_blank" rel="nofollow noopener noreferrer">our docs</a>. Make sure you're super careful when deleting data from the cloud because it's an irreversible action. Here's an example of a deletion command that will clear out everything in your cloud storage <em>except</em> what is referenced in your workspace.:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--cloud</span></span></code></pre></div> <p>This option only keeps the files and directories referenced in the workspace and it removes everything else, including data in the cloud and cache. By default, this command will use the default remote you have set. You can specify a different remote storage with the <code>--remote</code> option like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--cloud</span> <span class="token parameter variable">--remote</span> name_of_remote</span></code></pre></div> <h3 id="im-using-dvc-experiments-but-the-git-index-gets-corrupted-with-large-4gb-files-what-is-the-best-workaround" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/928939232033140736" target="_blank" rel="nofollow noopener noreferrer">I'm using DVC experiments, but the Git index gets corrupted with large (4GB) files. What is the best workaround?</a><a href="#im-using-dvc-experiments-but-the-git-index-gets-corrupted-with-large-4gb-files-what-is-the-best-workaround" aria-label="im using dvc experiments but the git index gets corrupted with large 4gb files what is the best workaround permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Great question from @charles.melby-thompson!</p> <p>Experiment files may be tracked by Git or DVC. For large files, we generally recommend tracking them with DVC, in which case file size shouldn't be an issue.</p> <p>By default, experiments will track all other files with Git. However, Git will fail with too much data. If there are files you don't want to track at all (such as large temporary/intermediate files), you can add them to your .gitignore file.</p> <p>Check out <a href="https://github.com/iterative/dvc/issues/6181" target="_blank" rel="nofollow noopener noreferrer">this open issue with experiments</a> for more details and to provide feedback.</p> <h3 id="is-there-an-easy-way-to-visualize-dvc-experiment-results-without-using-the-command-line" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/930150143259459644" target="_blank" rel="nofollow noopener noreferrer">Is there an easy way to visualize DVC experiment results without using the command line?</a><a href="#is-there-an-easy-way-to-visualize-dvc-experiment-results-without-using-the-command-line" aria-label="is there an easy way to visualize dvc experiment results without using the command line permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Good question @LucZ[Mad]!</p> <p>If you bring those experiments into your regular Git workflow, e.g. using <a href="https://dvc.org/doc/command-reference/exp/branch"><code>dvc exp branch</code></a> to create a branch for any experiment you want to share, you could use <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a> to visualize them.</p> <p>We're working on support for viewing any pushed experiments in Studio right now so if there's anything you want to see, make sure to comment on and follow <a href="https://github.com/iterative/studio-support/issues/45" target="_blank" rel="nofollow noopener noreferrer">this issue</a>.</p> <h3 id="can-cml-self-hosted-runners-stop-the-instance-after-the-idle-timeout-instead-of-terminating" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/933674203796873226" target="_blank" rel="nofollow noopener noreferrer">Can CML self-hosted runners stop the instance after the idle timeout instead of terminating?</a><a href="#can-cml-self-hosted-runners-stop-the-instance-after-the-idle-timeout-instead-of-terminating" aria-label="can cml self hosted runners stop the instance after the idle timeout instead of terminating permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is another fantastic question from @jotsif!</p> <p>No, we deliberately terminate the instance to avoid unexpected costs. Stopped but unterminated instances <a href="https://aws.amazon.com/premiumsupport/knowledge-center/ec2-billing-terminated/" target="_blank" rel="nofollow noopener noreferrer">can still cost the same as running ones</a>. It's best to let the CML runner terminate and create new instances, running <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to restore your data each time.</p> <p>However, if you're trying to preserve data (e.g. cache dependencies to speed up experimentation time) on an AWS EC2 instance, you could <a href="https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/" target="_blank" rel="nofollow noopener noreferrer">connect persistent AWS S3 remote storage</a>.</p> <h3 id="whats-the-difference-between-dvc-studio-free-and-enterprise-versions" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/933324508570472497" target="_blank" rel="nofollow noopener noreferrer">What's the difference between DVC Studio free and enterprise versions?</a><a href="#whats-the-difference-between-dvc-studio-free-and-enterprise-versions" aria-label="whats the difference between dvc studio free and enterprise versions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for asking @Abdi!</p> <p>You can find more info about the different <a href="https://studio.datachain.ai/#pricing" target="_blank" rel="nofollow noopener noreferrer">DVC Studio tiers here</a>.</p> <p>The <em>Free</em> tier has all the features most individual users need, like connecting to ML repositories, creating views, submitting experiments, and generating plots. The <em>Teams</em> tier allows you to create large teams for better collaboration and sharing of views and settings with everyone. The <em>Enterprise</em> tier is more for needs around compliance, dedicated support, and on-premise installation.</p> <p>If you are trying to decide which plan to select, please email us at <code>[email protected]</code> and we'll help you figure it out based on your needs.</p> <h3 id="how-can-i-use-one-dvcyaml-file-with-multiple-pipeline-folders-with-different-paramsyaml-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/939099847288578079" target="_blank" rel="nofollow noopener noreferrer">How can I use one <code>dvc.yaml</code> file with multiple pipeline folders with different <code>params.yaml</code> files?</a><a href="#how-can-i-use-one-dvcyaml-file-with-multiple-pipeline-folders-with-different-paramsyaml-files" aria-label="how can i use one dvcyaml file with multiple pipeline folders with different paramsyaml files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>@louisv, thanks for this question!</p> <p>It seems like you're looking for the parametrization functionality. You can learn more about how it works <a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">in this doc</a>, but here's a an example of what that might look like in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">cleanups</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token comment"># List of simple values</span> <span class="token punctuation">-</span> raw1 <span class="token punctuation">-</span> labels1 <span class="token punctuation">-</span> raw2 <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> clean.py "$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>" <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.cln</code></pre></div> <h3 id="is-it-possible-to-change-the-x-label-in-dvc-studio" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/938857004187943003" target="_blank" rel="nofollow noopener noreferrer">Is it possible to change the x-label in DVC Studio?</a><a href="#is-it-possible-to-change-the-x-label-in-dvc-studio" aria-label="is it possible to change the x label in dvc studio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>A great question about Studio from @PythonF!</p> <p>You can set custom properties for your plot in your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> like this:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">plots</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">plots_no_cache.csv</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">x</span><span class="token punctuation">:</span> r</code></pre></div> <p>You can also use <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots modify</code></a> to change the x-label or y-label for your plots using commands similar to the following.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots modify</span> plots_no_cache.csv <span class="token parameter variable">-x</span> r <span class="token parameter variable">-y</span> q</span></code></pre></div> <hr> <p><img src="https://media.giphy.com/media/h5Ct5uxV5RfwY/giphy.gif" alt="Done Tyler The Creator GIF"></p> <p>At our March Office Hours Meetup we will be about how you can create, run, and benchmark DVC pipelines with <a href="https://github.com/zincware/ZnTrack" target="_blank" rel="nofollow noopener noreferrer">ZnTrack</a>! <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/february-22-heartbeathttps://dvc.org/blog/february-22-heartbeatThu, 17 Feb 2022 00:00:00 GMT<details> <p>This month's Heartbeat image is inspired by Community member Daniel Barnes.<br> Daniel has been a great contributor to CML and helps out folks with questions in Discord as well as frequently attends our Meetups. This image is inspired from his <a href="https://app.orbit.love/dvc-community/members/danielbarnes" target="_blank" rel="nofollow noopener noreferrer">GitHub profile image</a> and the fact that he used to be a competitive paraglider. His record being 9.5 hours in the air! 😳 Many thanks to Daniel for his contributions to the Community that keeps us all flying high! 🪂</p> <summary>✨Image Inspo✨</summary> </details> <h1 id="community-news" style="position:relative;">Community News<a href="#community-news" aria-label="community news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/d3mn5mnDkwECLmnK/giphy.gif" alt="Stranger Things Math GIF by Wetpaint"></p> <p>The year is already flying by! Check out what's new this month!</p> <h2 id="fuzzylabs-open-source-mlops-is-awesome" style="position:relative;">FuzzyLabs Open Source MLOps is Awesome<a href="#fuzzylabs-open-source-mlops-is-awesome" aria-label="fuzzylabs open source mlops is awesome permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>So let me guess, still overwhelmed with MLOps tool choices? This past month <a href="https://www.linkedin.com/in/matt-squire-a19896125/" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Squire</strong></a> of <a href="http://FuzzyLabs.ai" target="_blank" rel="nofollow noopener noreferrer">Fuzzy Labs.ai</a> reviewed their <a href="github.com/fuzzylabs/awesome-open-mlops">Awesome Open Source MLOps repo,</a> <a href="https://fuzzylabs.ai/blog/open-source-mlops-is-awesome/" target="_blank" rel="nofollow noopener noreferrer">in this blog</a> and <a href="https://youtu.be/HIAPoKEDXrg" target="_blank" rel="nofollow noopener noreferrer">this video</a>. Matt breaks down the tool space into categories of SaaS platforms, fully open source tools, and partly open source tools. He describes how they define open source and why they think open source is the best choice in the MLOps space, which includes its trait of being <em>flexible</em>, <em>ownable</em>, <em>cost-effective</em>, and <em>agile</em>.</p> <blockquote> <p>"Turn key solutions quickly become inflexible." - Matt Squire</p> </blockquote> <p>Fuzzy Labs, a small AI company in Manchester, England, had a need for flexibility in their work with their clients, so they did a deep dive into MLOps tooling and established an MLOps Platform meeting the open source and flexible criteria they required. This stack includes our own <em>DVC</em>, as well as <a href="https://github.com/IDSIA/sacred" target="_blank" rel="nofollow noopener noreferrer">Sacred</a>, <a href="https://zenml.io/" target="_blank" rel="nofollow noopener noreferrer">ZenML</a>, <a href="https://www.seldon.io/tech/products/core" target="_blank" rel="nofollow noopener noreferrer">Seldon Core</a>, and <a href="https://evidentlyai.com/" target="_blank" rel="nofollow noopener noreferrer">Evidently AI.</a></p> <p>The blog and the video are definitely good material to review if you're choosing your ML tools.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/HIAPoKEDXrg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights--biases-implementation-and-converting-weights-to-onnx" style="position:relative;">Continuous Machine Learning on Huggingface Transformer with DVC including Weights & Biases Implementation and Converting Weights to ONNX.<a href="#continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights--biases-implementation-and-converting-weights-to-onnx" aria-label="continuous machine learning on huggingface transformer with dvc including weights biases implementation and converting weights to onnx permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As the title would suggest, <a href="https://medium.com/@arjunkumbakkara/continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights-biases-1bc4520d210" target="_blank" rel="nofollow noopener noreferrer">this jam packed article</a> from <a href="https://github.com/nabarunbaruaAIML" target="_blank" rel="nofollow noopener noreferrer"><strong>Nabarun Barua</strong></a>, and <a href="https://github.com/arjunKumbakkara" target="_blank" rel="nofollow noopener noreferrer"><strong>Arjun Kumbakkara</strong></a> focuses in on how CML can be implemented into an NLP project. They assume knowledge of DVC, Transformers, ONNX and Weights & Biases, so be ready to take your skills to the next level automating parts of the process with CML.</p> <p>They begin with the all-important setups of AWS IAM user with EC2 & S3 Developer access, the S3 bucket to store the dataset, and requesting an EC2 spot instance. They then continue into a detailed description of all the stages of the project, outlining the use of all the tools including DVC Studio. You can find <a href="https://github.com/nabarunbaruaAIML/CML_with_DVC_on_Transformer_NLP" target="_blank" rel="nofollow noopener noreferrer">the repo for the project here.</a> Looking forward to the next installment from Nabarun and Arjun on a Dockerized Container Application cluster with Kubernetes Orchestration. 🍿</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5e13082e5c6a0afb25bd2a81396f76df/39600/arjun-kumbakkara-architecture.png" alt="Training, Deployment and Retraining Architecture" title="Training, Deployment and Retraining Architecture" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Total architecture with the Training, Deployment, and Retraining Pipelines in the same order. (<a href="https://medium.com/@arjunkumbakkara/continuous-machine-learning-on-huggingface-transformer-with-dvc-including-weights-biases-1bc4520d210" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="dvc-used-to-help-extract-knowledge-from-covid-19-research" style="position:relative;">DVC Used to help extract knowledge from COVID-19 research<a href="#dvc-used-to-help-extract-knowledge-from-covid-19-research" aria-label="dvc used to help extract knowledge from covid 19 research permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In case you missed it in our <a href="https://twitter.com/ivanovitchm/status/1482742970461863939?s=20&t=QrfDTRHcZOKWIe5n5mb7ZQ" target="_blank" rel="nofollow noopener noreferrer">Twitter feed</a>, a group of scientists <a href="https://link.springer.com/article/10.1007/s11192-021-04260-y" target="_blank" rel="nofollow noopener noreferrer">published an article</a> in <a href="https://link.springer.com/journal/11192" target="_blank" rel="nofollow noopener noreferrer">Scientometrics Journal</a> entitled, <em>Discovering temporal scientometric knowledge in COVID-19 scholarly production</em>. The authors, <a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Breno_Santana-Santos" target="_blank" rel="nofollow noopener noreferrer"><strong>Breno Santana Santos</strong></a>, <a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Ivanovitch-Silva" target="_blank" rel="nofollow noopener noreferrer"><strong>Ivanovitch Silva</strong></a>, <a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Luciana-Lima" target="_blank" rel="nofollow noopener noreferrer"><strong>Luciana Lima</strong></a>, <a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Patricia_Takako-Endo" target="_blank" rel="nofollow noopener noreferrer"><strong>Patricia Takako Endo</strong></a>, <a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Gisliany-Alves" target="_blank" rel="nofollow noopener noreferrer"><strong>Gisliany Alves</strong></a>, & <a href="https://link.springer.com/article/10.1007/s11192-021-04260-y#auth-Marcel_da_C_mara-Ribeiro_Dantas" target="_blank" rel="nofollow noopener noreferrer"><strong>Marcel da Câmara Ribeiro-Dantas</strong></a>, used DVC to create a reproducible workflow that combined machine learning and Complex Network Analysis techniques to extract implicit and temporal knowledge from Scientific production bases on COVID-19.</p> <blockquote> <p>"The presented methodology has the potential to instrument and expand strategic and proactive decisions of the scientific community aiming at knowledge extraction that supports the fight against the pandemic."</p> </blockquote> <p>We are so happy to be helpful in the fight against the pandemic! Be sure to check out the paper and keep your eyes out for a Meetup in the future where they present this work!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/70096f540bd1c2c3c5cbf29aca5b187b/39600/scientometric.png" alt="DVC in Scientometric Covid Research" title="DVC in Scientometric Covid Research" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Discovering temporal scientometric knowledge in COVID-19 scholarly production (<a href="https://link.springer.com/article/10.1007/s11192-021-04260-y" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h1 id="github-goodness-and-integrations" style="position:relative;">GitHub Goodness and Integrations<a href="#github-goodness-and-integrations" aria-label="github goodness and integrations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <ul> <li> <p>If you're a <a href="https://guild.ai/" target="_blank" rel="nofollow noopener noreferrer"><strong>Guild.Ai</strong></a> user, you'll be happy to know that Guild now supports DVC! Find out more in <a href="https://my.guild.ai/t/using-guild-ai-with-dvc/803" target="_blank" rel="nofollow noopener noreferrer">this article</a> by <a href="https://www.linkedin.com/in/gar1t/" target="_blank" rel="nofollow noopener noreferrer"><strong>Garret Smith</strong></a>and the <a href="https://github.com/guildai/guildai/tree/dvc/examples/dvc" target="_blank" rel="nofollow noopener noreferrer">corresponding repo</a> for an example.</p> </li> <li> <p><a href="https://github.com/lucmos" target="_blank" rel="nofollow noopener noreferrer"><strong>Luca Moschella</strong></a> created <a href="https://github.com/grok-ai/nn-template" target="_blank" rel="nofollow noopener noreferrer">this <strong>NN template</strong></a> for your neural network projects where you want to combine PyTorch Lightning, Hydra, DVC, Weights and Biases and Streamlit.</p> </li> <li> <p>Just a reminder for your NLP projects, <a href="https://spacy.io/" target="_blank" rel="nofollow noopener noreferrer"><strong>SpaCy</strong></a> integrates with DVC as well. You can find out more info on <a href="https://spacy.io/usage/projects#integrations" target="_blank" rel="nofollow noopener noreferrer">the integration here.</a></p> </li> </ul> <p><img src="https://media.giphy.com/media/13zeE9qQNC5IKk/giphy.gif" alt="Seal Of Approval Thumbs Up GIF"></p> <h1 id="in-other-data-science-and-ai-news" style="position:relative;">In Other Data Science and AI News<a href="#in-other-data-science-and-ai-news" aria-label="in other data science and ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="10-most-important-jobs-for-ml-products-in-2022" style="position:relative;">10 Most Important Jobs for ML Products in 2022<a href="#10-most-important-jobs-for-ml-products-in-2022" aria-label="10 most important jobs for ml products in 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f9c2c020df1b6701551c734955fb0837/39600/roles-in-ai.png" alt="10 Most Important Jobs for ML Products in 2022" title="10 Most Important Jobs for ML Products in 2022 =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> People new to the data science/ml space are often overwhelmed by all that there is to learn, and determining the path to get there. When I get this question from Community members, I always have the same advice: try to figure out what part of DS/AI is most interesting to you and then work to building your skills toward that. In this article on the <a href="https://medium.datadriveninvestor.com/the-10-most-important-jobs-for-ml-products-in-2022-7bf844d62423" target="_blank" rel="nofollow noopener noreferrer">10 Most Important Jobs for ML Products in 2022</a>, <a href="https://www.linkedin.com/in/agoston-torok/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ágoston Török</strong></a> does a great job of defining the different roles in the space, how they interrelate, and how they show up in AI companies in the product development process. See his breakdown of the roles above, with rows defining the stage, and columns, the aspects the roles focus on. If you find you are drawn to the space where the DS prototypes become the software product, then you may want to check out <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">our new course!</a> 😉</p> <h2 id="engineering-best-practices-for-machine-learning" style="position:relative;">Engineering Best Practices for Machine Learning<a href="#engineering-best-practices-for-machine-learning" aria-label="engineering best practices for machine learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Diving deeper into these roles, the team was a buzz recently, reviewing <a href="https://se.ewi.tudelft.nl/remla/slides/07_ASerban_mleng_practices.pdf" target="_blank" rel="nofollow noopener noreferrer">this slide deck</a> on <em>Engineering Best Practices for Machine Learning</em> by <a href="https://www.linkedin.com/in/serbanac/" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Serban</strong></a>. In it Alex discusses the challenges of creating software from machine learning projects, the differences between these projects and traditional software development, and the need for developing robust and ethical practices. He and his colleagues, <a href="https://liacs.leidenuniv.nl/~blomkvander/" target="_blank" rel="nofollow noopener noreferrer"><strong>Koen van der Blom</strong></a>, <a href="https://ada.liacs.nl/members/" target="_blank" rel="nofollow noopener noreferrer"><strong>Holger Hoos</strong></a>, and <a href="https://jstvssr.github.io/" target="_blank" rel="nofollow noopener noreferrer"><strong>Joost Visser</strong></a> created a survey to determine current adoption of best practices in the industry. Along with the great review of the survey results in the slides, a number of resources were provided including <a href="https://github.com/SE-ML/awesome-seml/blob/master/readme.md" target="_blank" rel="nofollow noopener noreferrer">the corresponding Awesome list, </a> a <a href="https://se-ml.github.io/practices/" target="_blank" rel="nofollow noopener noreferrer">Catalog of Best ML Engineering Practices</a>, and their <a href="https://se-ml.github.io/" target="_blank" rel="nofollow noopener noreferrer">project website</a> for more information on the whole project. Definitely worth your review! ✅</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d3571466bc128ab349dca2ab39d07161/39600/alex-serban.png" alt="Engineering Best Practices for Machine Learning" title="Engineering Best Practices for Machine Learning" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>29 Machine Learning Engineering practices ranked by adoption (<a href="https://se.ewi.tudelft.nl/remla/slides/07_ASerban_mleng_practices.pdf" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="twine-ethical-datasets" style="position:relative;">Twine Ethical Datasets<a href="#twine-ethical-datasets" aria-label="twine ethical datasets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Are you in need of ethically sourced audio or video data for your ML project? <a href="https://www.twine.net/ai" target="_blank" rel="nofollow noopener noreferrer">Twine</a> has created a way to accomplish this, while simultaneously freeing ML teams of the project management lift associated with the collection of these datasets.<br> You can learn more about Twine's efforts in ethical data collection through these articles, <a href="https://www.twine.net/blog/the-importance-of-ethically-sourced-data/" target="_blank" rel="nofollow noopener noreferrer">The Importance of Ethically Sourced Data,</a> <a href="https://www.twine.net/blog/bias-in-data-collection/" target="_blank" rel="nofollow noopener noreferrer">Bias in Data Collection, </a> <a href="https://www.twine.net/blog/diversity-data-inclusive-workforce/" target="_blank" rel="nofollow noopener noreferrer">Collecting Diversity Data: How to Ensure an Inclusive Workforce,</a> and <a href="https://www.twine.net/blog/the-hidden-costs-of-bad-data/" target="_blank" rel="nofollow noopener noreferrer">The Hidden Costs of Bad Data.</a> Twine also provides <a href="https://www.twine.net/blog/100-audio-and-video-datasets/" target="_blank" rel="nofollow noopener noreferrer">100 open audio and video datasets</a> for anyone working on these types of projects. Check it out! 👇🏽</p> <p> </p><section class="elp-content-holder"> <a href="https://www.twine.net/blog/100-audio-and-video-datasets/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Twine Ethically Sourced Datasets</h4> <div class="elp-description">100 Ethically sourced audio and video datasets from Twine.</div> <div class="elp-link">https://twine.net/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-02-17/twine-b098886b6287a5d276c534bb8c2de293.png" alt="Twine Ethically Sourced Datasets"> </div> </a> </section> <p></p> <h2 id="batterydev-hackathon-2022" style="position:relative;">BatteryDEV Hackathon 2022<a href="#batterydev-hackathon-2022" aria-label="batterydev hackathon 2022 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Are you interested in battery technology and in participating in a Hackathon using battery data? The <a href="https://www.tfir.io/how-experiment-versioning-is-going-to-solve-big-problems-of-ai-ml-world/" target="_blank" rel="nofollow noopener noreferrer">growth of battery technology</a> is climbing quickly as the world is looking to solve some of the world's emissions issues with electronic vehicles. Additionally the demand for electric vehicles <a href="https://www.mckinsey.com/business-functions/operations/our-insights/unlocking-growth-in-battery-cell-manufacturing-for-electric-vehicles" target="_blank" rel="nofollow noopener noreferrer">is outpacing</a> the manufacturers' ability to supply the needed batteries. Datasets in the space are kept proprietary as companies work independently to develop patents. BatteryDEV 2022 aims to accelerate battery innovation through open source competitions. This year they are expecting 300 participants for the event from March 20-26. Community member <a href="https://www.linkedin.com/in/raymond-james-gasper/" target="_blank" rel="nofollow noopener noreferrer">Raymond Gasper</a> is one of the organizers of <a href="https://battery.dev" target="_blank" rel="nofollow noopener noreferrer">Battery.dev</a>, and is creating a DVC template for participants to use during the Hackathon. You can <a href="https://www.battery.dev/registration-form" target="_blank" rel="nofollow noopener noreferrer">register for the event here!</a></p> <p> </p><section class="elp-content-holder"> <a href="https://battery.dev" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">BatteryDEV 2022 Hackathon</h4> <div class="elp-description">A global innovation challenge for battery, data and machine learning enthusiasts.</div> <div class="elp-link">https://battery.dev/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-02-17/battery-dev-f0777718a6d186b28446066b3f901cc4.png" alt="BatteryDEV 2022 Hackathon"> </div> </a> </section> <p></p> <h1 id="company-news" style="position:relative;">Company News<a href="#company-news" aria-label="company news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a> talked to <a href="https://twitter.com/SwapBhartiya" target="_blank" rel="nofollow noopener noreferrer"><strong>Swapnil Bhartiya</strong></a> recently about how experiment versioning can help to solve the big problems of the AI/ML world. In this interview you will learn how experiment versioning tracks everything you need for a particular experiment so that the result is reproducible from prototyping to production. This solution enables data science and engineering teams to work more productively together.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/y5zp54LiAqg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="march-office-hours" style="position:relative;">March Office Hours!<a href="#march-office-hours" aria-label="march office hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Be sure to join us at the <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" target="_blank" rel="nofollow noopener noreferrer">March Office Hours Meetup,</a> where <a href="https://github.com/PythonFZ/" target="_blank" rel="nofollow noopener noreferrer"><strong>Fabian Zills</strong>,</a> PhD student at <a href="https://www.uni-stuttgart.de/en/" target="_blank" rel="nofollow noopener noreferrer">University of Stuttgart,</a> will present his ZnTrack project which creates, runs and benchmarks DVC pipelines in Python and Jupyter Notebooks.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/Machine-Learning-Engineer-Community-Virtual-Meetups/events/283998696/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">March Office Hours - ZnTrack</h4> <div class="elp-description">RSVP for DVC Office Hours - ZnTrack - Create, Visualize, Run and Benchmark DVC Pipelines in Python & Jupyter Notebooks </div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-02-17/office-hours-meetup-39d6c71b2928c57d1858c4544400dffc.png" alt="March Office Hours - ZnTrack"> </div> </a> </section> <p></p> <h2 id="new-hires" style="position:relative;">New Hires<a href="#new-hires" aria-label="new hires permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We are extremely excited to welcome our new Director of Engineering, <a href="https://www.linkedin.com/in/odedmesser/" target="_blank" rel="nofollow noopener noreferrer"><strong>Oded Messer</strong></a>. Oded lives in Israel and plans to pour his time and attention into the people/processes/structures of the engineering org to facilitate healthy growth and culture.💗 He brings hands-on and managerial industry experience in the backend/tooling/infra and MLOps domains (ex. Intel and Iguazio). In his spare time Oded remembers traveling being a favorite activity, and also admits to being a sci-fi geek. He's in good company here! 😉</p> <p>We welcome <a href="https://twitter.com/alex000kim" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Kim</strong></a> who joins us as a Field Data Scientist from Montreal, Canada. Alex's previous professional experience has been at the intersection of Software Engineering and Data Science across a few different industries. He has also done consulting work to develop Data Science curriculums for EdTech companies. Alex speaks Russian and a little French in addition to English. In his free time, Alex likes to bake, his specialty being pizza! 🍕</p> <details> <p>We now have three Alex's on the team to match our three Davids!</p> <summary>🎉Fun Fact!</summary> </details> <p><a href="https://github.com/jesper7" target="_blank" rel="nofollow noopener noreferrer"><strong>Jesper Svendsen</strong></a> joins the team as a Platform Engineer from Denmark.<br> Previously, Jesper worked as an SRE for Evaxion Biotech (another ML-driven company). Prior to that, he was a self-employed IT consultant, where he did full-stack development. Jesper's hobbies include reading books, (particularly medicine and psychology books), weightlifting, running, and photography. 📸</p> <details> <p>Jesper makes the eighth employee joining <a href="https://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">Iterative.AI</a> with a name starting with the letter 'j.' I thought this was odd, as words that start with 'j' have one of the <a href="https://funbutlearn.com/2012/06/which-english-letter-has-maximum-words.html" target="_blank" rel="nofollow noopener noreferrer">lowest frequencies in the English language</a>. But as it turns out, 'J' is <a href="https://www.quora.com/What-letter-of-the-English-alphabet-are-used-most-as-the-first-letter-of-the-first-name" target="_blank" rel="nofollow noopener noreferrer">one of the more common first initials.</a></p> <summary>🎉Fun Fact!</summary> </details> <p><a href="https://github.com/erudin" target="_blank" rel="nofollow noopener noreferrer"><strong>Gabriella Caraballo</strong></a> joins Iterative as a Backend Engineer. She is originally from Venezuela, but is currently living in Canada! Programming was a hobby that became a professional path for Gabriella. She loves everything related to security, privacy and open source. In her free time, Gabriella enjoys cooking and eating, playing video/board games, crocheting, photography, and music. Now that she's in Canada, she has added skiing to her hobbies! ⛷</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Even with these amazing new additions to the team, we're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions and share with anyone you think may be interested! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">With tools like <a href="https://twitter.com/DVCorg">@DVCorg</a> & <a href="https://twitter.com/TheRealDagsHub">@TheRealDAGsHub</a> you can easily share , review & reproduce/reuse your work. <br><br>Just like how git makes software development smooth for software developers that's how tools like DVC make reproducibility smooth for ML Engineers.</p>— Gift Ojeabulu (@GiftOjeabulu_) <a href="https://twitter.com/GiftOjeabulu_/status/1490771330949599234">February 7, 2022</a></blockquote> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/january-22-community-gemshttps://dvc.org/blog/january-22-community-gemsMon, 31 Jan 2022 00:00:00 GMT<h3 id="is-it-possible-to-stream-objects-to-and-from-remote-caches" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/919567459189682177" target="_blank" rel="nofollow noopener noreferrer">Is it possible to stream objects to and from remote caches?</a><a href="#is-it-possible-to-stream-objects-to-and-from-remote-caches" aria-label="is it possible to stream objects to and from remote caches permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for asking @mihaj!</p> <p>You can stream files using the <a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer">DVC API</a>. There are two methods that you'll likely want to check out. First there's <a href="https://dvc.org/doc/api-reference/open"><code>dvc.api.open()</code></a>. This opens a file tracked by DVC and generates a corresponding file object. Here's a quick example:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api <span class="token keyword">with</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span> <span class="token string">'get-started/data.xml'</span><span class="token punctuation">,</span> repo<span class="token operator">=</span><span class="token string">'https://github.com/iterative/dataset-registry'</span> <span class="token punctuation">)</span> <span class="token keyword">as</span> fd<span class="token punctuation">:</span> <span class="token comment"># do things with the file object here</span></code></pre></div> <p>The simplest way to return the contents from a DVC tracked file would be to use <a href="https://dvc.org/doc/api-reference/read"><code>dvc.api.read()</code></a>. The returned content can be a bytearray or string. Here's a little example of this being used:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> pickle <span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api model <span class="token operator">=</span> pickle<span class="token punctuation">.</span>loads<span class="token punctuation">(</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>read<span class="token punctuation">(</span> <span class="token string">'model.pkl'</span><span class="token punctuation">,</span> repo<span class="token operator">=</span><span class="token string">'https://github.com/iterative/example-get-started'</span> mode<span class="token operator">=</span><span class="token string">'rb'</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span></code></pre></div> <h3 id="one-of-the-steps-in-my-dvc-pipeline-uses-a-pip-installed-package-what-is-the-best-way-to-make-sure-that-dvc-re-runs-the-steps-that-depend-on-that-package" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/920139825284280381" target="_blank" rel="nofollow noopener noreferrer">One of the steps in my DVC pipeline uses a <code>pip</code> installed package. What is the best way to make sure that DVC re-runs the steps that depend on that package?</a><a href="#one-of-the-steps-in-my-dvc-pipeline-uses-a-pip-installed-package-what-is-the-best-way-to-make-sure-that-dvc-re-runs-the-steps-that-depend-on-that-package" aria-label="one of the steps in my dvc pipeline uses a pip installed package what is the best way to make sure that dvc re runs the steps that depend on that package permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @alphaomega!</p> <p>The best way to handle any package dependencies is to include a <code>requirements.txt</code> file with the specific versions your pipeline needs.</p> <p>Another approach you can take is having a stage that dumps the package version as an intermediate output. It doesn't have to be saved in Git or DVC because it's easily reproduced and DVC should be able to take care of detecting that the package didn't change. Here's an example of a stage that does this.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">package_version</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> pip freeze <span class="token punctuation">|</span> grep "package_name==" <span class="token punctuation">></span> package_name_version.txt <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> package_name_version.txt <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> package_name_version.txt</code></pre></div> <h3 id="does-dvc-save-dependencies-which-are-in-the-dvcyaml-pipeline-to-the-cache" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/920659549835370497" target="_blank" rel="nofollow noopener noreferrer">Does DVC save dependencies which are in the <code>dvc.yaml</code> pipeline to the cache?</a><a href="#does-dvc-save-dependencies-which-are-in-the-dvcyaml-pipeline-to-the-cache" aria-label="does dvc save dependencies which are in the dvcyaml pipeline to the cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for another great question @rie!</p> <p>DVC doesn't track the pipeline dependencies in the cache or storage, only the outputs. If you want DVC to track a pure data dependency that's not an output of a different stage, you need to track it with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add ...</code></a></p> <p>The output of a pipeline might be something like <code>data.dvc</code>, while a pure dependency might be a file that's just a part of the project, like <code>script.py</code>. That's why you'll need to use the <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> command to track this.</p> <h3 id="what-is-the-difference-between-kubeflow-pipelines-and-dvc-pipelines" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/922728960478035978" target="_blank" rel="nofollow noopener noreferrer">What is the difference between Kubeflow pipelines and DVC pipelines?</a><a href="#what-is-the-difference-between-kubeflow-pipelines-and-dvc-pipelines" aria-label="what is the difference between kubeflow pipelines and dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a fantastic question! Thanks for asking @ramakrishnamamidi!</p> <p>A major difference is that DVC focuses primarily on ML <em>development</em> and adding lightweight functionality on top of existing projects, which may be reusable in deployment in some cases.</p> <p>Kubeflow focuses on <em>deployment</em> and building on top of Kubernetes, which could be used during development but requires more up-front effort.</p> <h3 id="could-dvc-be-a-good-alternative-to-lfs-for-game-development" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485586884165107734/928336349487067196" target="_blank" rel="nofollow noopener noreferrer">Could DVC be a good alternative to LFS for game development?</a><a href="#could-dvc-be-a-good-alternative-to-lfs-for-game-development" aria-label="could dvc be a good alternative to lfs for game development permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for such an interesting question @CB!</p> <p>Yes! We have community members that use DVC to handle their large files in game development.</p> <p>There are several other use cases we've seen for DVC outside of machine learning and data science. Some people have used DVC to track build artifacts for deployment systems and to track performance data alongside design iterations and simulation tools.</p> <p>You should check out our <a href="https://discord.com/channels/485586884165107732/918159153824952320" target="_blank" rel="nofollow noopener noreferrer">#beyond-ml</a> Discord channel to stay up to date with the other use cases the community is coming p with!</p> <h3 id="does-dvc-run-on-jsonyaml-configuration-files-for-all-things" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/928779586622332938" target="_blank" rel="nofollow noopener noreferrer">Does DVC run on JSON/YAML configuration files for all things?</a><a href="#does-dvc-run-on-jsonyaml-configuration-files-for-all-things" aria-label="does dvc run on jsonyaml configuration files for all things permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a great question about large projects with a lot of dependencies from @SolemnSimulacrum!</p> <p>All of the dependencies you list in <code>dvc run</code> are in fact configured in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. <code>dvc run</code> is a convenience for adding a pipeline stage to this file and then doing <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> on that stage. It's completely acceptable and even encouraged to directly edit <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> if that's easier.</p> <p>For example, if you are currently executing a command like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> prune <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> model.pt <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> ./DepFiles_0/ <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> ./DepFiles_1/ <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> ./DepFiles_2/ <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> ./src/.py <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> ./packages/.py <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> ./scripts/.py <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> ./data/.npy <span class="token punctuation">\</span> python script.py</span></code></pre></div> <p>You could add those directly to the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> like this:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">prune</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python script.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> ./DepFiles_0/ <span class="token punctuation">-</span> ./DepFiles_1/ <span class="token punctuation">-</span> ./DepFiles_2/ <span class="token punctuation">-</span> ./src/.py <span class="token punctuation">-</span> ./packages/.py <span class="token punctuation">-</span> ./scripts/.py <span class="token punctuation">-</span> ./data/.npy <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> model.pt</code></pre></div> <h3 id="im-setting-up-mlops-at-my-company-from-scratch-and-we-use-gitlab-and-cloudera-ds-workbench-what-are-the-best-resources-to-get-started-with-cml" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/923785806848614461" target="_blank" rel="nofollow noopener noreferrer">I'm setting up MLOps at my company from scratch and we use GitLab and Cloudera DS workbench. What are the best resources to get started with CML?</a><a href="#im-setting-up-mlops-at-my-company-from-scratch-and-we-use-gitlab-and-cloudera-ds-workbench-what-are-the-best-resources-to-get-started-with-cml" aria-label="im setting up mlops at my company from scratch and we use gitlab and cloudera ds workbench what are the best resources to get started with cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a great question from @dvc!</p> <p>We recommend you start with the <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML docs website</a>.</p> <p>You can find some tutorials on <a href="https://dvc.org/blog" target="_blank" rel="nofollow noopener noreferrer">our blog</a>.</p> <p>Or you can check out the videos on <a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">our YouTube channel</a></p> <p>And of course, you can always ask questions in the Discord community!</p> <h3 id="i-understand-that-dvc-studio-is-a-discoverability-layer-over-my-dvc-repo-in-github-will-any-of-my-data-be-stored-on-your-servers" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/923714473603256420" target="_blank" rel="nofollow noopener noreferrer">I understand that DVC Studio is a discoverability layer over my DVC repo in GitHub. Will any of my data be stored on your servers?</a><a href="#i-understand-that-dvc-studio-is-a-discoverability-layer-over-my-dvc-repo-in-github-will-any-of-my-data-be-stored-on-your-servers" aria-label="i understand that dvc studio is a discoverability layer over my dvc repo in github will any of my data be stored on your servers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a great question about DVC Studio from @johnnyaug!</p> <p>DVC Studio only stores metrics, plots, and metadata about your pipelines in the databases to be able to serve this as a table. We don't read actual data and we don't store code.</p> <p>An important thing to note is that if you have plots from <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a> that are images, JSON files, or vega specs, those could be saved on our end as well to serve them to UI.</p> <p>We're working on documentation for this as well!</p> <hr> <p><img src="https://media.giphy.com/media/zCME2Cd20Czvy/giphy.gif" alt="The Lord Of The Rings GIF"></p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/January-22-heartbeathttps://dvc.org/blog/January-22-heartbeatTue, 18 Jan 2022 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Happy New Year! Hope you got some good rest and stayed healthy at the end of 2021, because 2022 has lots of great things in store!</p> <p><img src="https://media.giphy.com/media/7ILAGpJWoQYWA0j60C/giphy.gif" alt="Heartbeat!"></p> <h2 id="diego-jardim---mlops-a-complete-hands-on-introduction" style="position:relative;">Diego Jardim - MLOps: A Complete Hands-On Introduction<a href="#diego-jardim---mlops-a-complete-hands-on-introduction" aria-label="diego jardim mlops a complete hands on introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://poatek.com/2021/12/20/mlops-a-complete-and-hands-on-introduction-part-1/" target="_blank" rel="nofollow noopener noreferrer">In Part 1</a> of his two-part series, <a href="https://www.linkedin.com/in/diegosevero/" target="_blank" rel="nofollow noopener noreferrer"><strong>Diego Jardim</strong></a> of <a href="https://poatek.com/" target="_blank" rel="nofollow noopener noreferrer">Poatek</a> takes us through the basics of MLOps and the stages of implementation and maturity of an MLOps pipeline. He closes by introducing us to some tools to help a team progress through these stages, which include DVC and CML.</p> <p><a href="https://poatek.com/2021/12/29/mlops-a-complete-and-hands-on-introduction-part-2/" target="_blank" rel="nofollow noopener noreferrer">In Part 2</a> he delves into more detail and code on how to set up version control of everything with DVC as well as automation of experimentation and reporting with CML. Finally, he uses FastAPI and Heroku for model serving and deployment. You can find all the scripts for the project in <a href="https://github.com/dsjardim/fraud-detection-mlops" target="_blank" rel="nofollow noopener noreferrer">this GitHub repository.</a></p> <p> </p><section class="elp-content-holder"> <a href="https://poatek.com/2021/12/29/mlops-a-complete-and-hands-on-introduction-part-2/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLOps: A Complete Hands-On Tutorial</h4> <div class="elp-description">In his 2-part series, Diego Jardim of Paotek introduces concepts and stages of MLOps and provides a tutorial on how to create an MLOps pipeline.</div> <div class="elp-link">https://poatek.com/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-01-18/diego-jardim-11d746cd4b5f1f2f63503c8ce444a180.png" alt="MLOps: A Complete Hands-On Tutorial"> </div> </a> </section> <p></p> <h2 id="carl-w-handlin-wallace---reproducible-data-science-and-why-it-matters" style="position:relative;">Carl W. Handlin Wallace - Reproducible Data Science and Why it Matters<a href="#carl-w-handlin-wallace---reproducible-data-science-and-why-it-matters" aria-label="carl w handlin wallace reproducible data science and why it matters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/carlhandlin/" target="_blank" rel="nofollow noopener noreferrer"><strong>Carl W. Handlin Wallace</strong></a> of <a href="https://www.rappibank.pe/" target="_blank" rel="nofollow noopener noreferrer">RappiBank</a> wrote a <a href="https://medium.com/rappibank/reproducible-data-science-and-why-it-matters-e4e62fd60b9a/" target="_blank" rel="nofollow noopener noreferrer">great article</a> for their company <a href="https://medium.com/" target="_blank" rel="nofollow noopener noreferrer">Medium</a> profile on the importance of reproducibility, AKA replicability, in science in general, and the challenges in Data Science in particular. As he points out, from <a href="https://doi.org/10.1038/533452a" target="_blank" rel="nofollow noopener noreferrer">Nature's survey,</a> over half of all researchers have failed to reproduce even their own work, let alone that of another scientist. While initiatives like <a href="https://paperswithcode.com/" target="_blank" rel="nofollow noopener noreferrer">Papers With Code</a> are helping to encourage reproducibility in the industry, there's still work to be done. He notes DVC as a part of the solution to this problem along with other tools to round out the whole picture. Check out the article for good food for thought and other resources!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c52351436c2bfd17633dbb36d3dfd200/39600/carl-handlin-rappibank.png" alt="Proposed Reproducibility Framework for Data Science" title="Proposed Reproducibility Framework for Data Science" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p><em>Carl W. Handlin Wallace's Proposed Reproducibility Framework for Data Science (<a href="https://medium.com/rappibank/reproducible-data-science-and-why-it-matters-e4e62fd60b9a/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="abid-ali-awan---tips--tricks-of-deploying-deep-learning-webapp-on-heroku-cloud" style="position:relative;">Abid Ali Awan - Tips & Tricks of Deploying Deep Learning Webapp on Heroku Cloud<a href="#abid-ali-awan---tips--tricks-of-deploying-deep-learning-webapp-on-heroku-cloud" aria-label="abid ali awan tips tricks of deploying deep learning webapp on heroku cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 450px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2a68cbe567a7492f57b697eb6bbf9273/39600/abid-ali-awan.png" alt="DVC Heroku Integration" title="Heroku Hidden Tricks =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p><a href="https://www.linkedin.com/in/1abidaliawan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Abid Ali Awan</strong>'s</a> <a href="https://www.kdnuggets.com/2021/12/tips-tricks-deploying-dl-webapps-heroku.html" target="_blank" rel="nofollow noopener noreferrer">article in KDNuggets</a><br> guides you on how to create a smooth process to deploy a deep learning web application with Heroku. In the guide, he covers integration with DVC and optimizing storage using Docker, Git & CLI-based deployment, how to deal with error code H10, and tweaking Python packages to stay within the 500 MB Heroku limitation. If you've been looking for a way to create a deep learning web app, this may help!</p> <h2 id="amit-kulkarni---overview-of-mlops-with-open-source-tools" style="position:relative;">Amit Kulkarni - Overview of MLOps with Open Source Tools<a href="#amit-kulkarni---overview-of-mlops-with-open-source-tools" aria-label="amit kulkarni overview of mlops with open source tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In the very <a href="https://www.analyticsvidhya.com/blog/2022/01/overview-of-mlops-with-open-source-tools/" target="_blank" rel="nofollow noopener noreferrer"><strong>FIRST</strong> tutorial of DVC Studio</a> from the Community, <a href="http://www.linkedin.com/in/amitvkulkarni2" target="_blank" rel="nofollow noopener noreferrer"><strong>Amit Kulkarni</strong></a> reviews the set up process of DVC Studio and MLFlow and their ability to ease the operational aspects of machine learning teams by providing a clear way to solve the formidable task of tracking all the factors that go into the iterative process. Amit covers the easy setup process, adding a view, model comparison, and running experiments from the DVC Studio UI.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/a4a6d7e537e16595bcd3e6f92afae851/39600/amit-kulkarni-studio.png" alt="DVC Studio Experiment Tracker UI" title="DVC Studio Experiment Tracker UI" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Amit Kulkarni's DVC Studio tutorial (<a href="https://www.analyticsvidhya.com/blog/2022/01/overview-of-mlops-with-open-source-tools/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h1 id="github-goodness-" style="position:relative;">GitHub Goodness 😎<a href="#github-goodness-" aria-label="github goodness permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/3ohzdIuqJoo8QdKlnW/giphy.gif" alt="Will Ferrell Reaction GIF"></p> <p>In case you missed it we now have an <a href="https://github.com/iterative/awesome-iterative-projects" target="_blank" rel="nofollow noopener noreferrer">Awesome Iterative Projects Repository.</a> This repository is a list of projects relying on Iterative tools to achieve awesomeness. Recent additions to the list include:</p> <ul> <li><a href="https://github.com/zincware/ZnTrack" target="_blank" rel="nofollow noopener noreferrer">zincware/ZnTrack</a>: Create, visualize, run & benchmark DVC pipelines in Python & Jupyter notebooks.</li> <li><a href="https://github.com/gennaro-tedesco/nvim-dvc" target="_blank" rel="nofollow noopener noreferrer">nvim-dvc</a>: Neovim plugin for DVC.</li> </ul> <p>We'd love to see more of the Community's awesome work added to this list. Feel free to submit your project!</p> <p>Other repos that came across my radar this last month that may be of interest to our Community:</p> <ul> <li><a href="https://github.com/Nachimak28/awesome-list-of-awesomes" target="_blank" rel="nofollow noopener noreferrer">An Awesome List of Awesomes</a>: an aggregation of all the Awesome lists</li> <li><a href="https://github.com/visenger/awesome-mlops" target="_blank" rel="nofollow noopener noreferrer">Awesome MLOps</a>: an awesome list of references for MLOps.</li> <li><a href="https://github.com/mateuspicanco/project-atlas-sao-paulo" target="_blank" rel="nofollow noopener noreferrer">Project Atlas - São Paulo</a> : a Data Science and Engineering initiative that aims to develop relevant and curated Geospatial features of São Paulo, Brazil (includes DVC).</li> <li><a href="https://github.com/lucmos/nn-template" target="_blank" rel="nofollow noopener noreferrer">NN Template</a>: Generic template to bootstrap your PyTorch project (includes DVC)</li> </ul> <h1 id="deciding-on-mlops-tools" style="position:relative;">Deciding on MLOps tools?<a href="#deciding-on-mlops-tools" aria-label="deciding on mlops tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/3ohjUZZEFfWJfaeKUE/giphy.gif" alt="Think Season 2 GIF by Portlandia"></p> <p><a href="https://media.giphy.com/media/3ohjUZZEFfWJfaeKUE/giphy.gif" target="_blank" rel="nofollow noopener noreferrer">Last month</a> I told you about Thoughtworks' guide to MLOps Platforms. If you prefer video content, you may like <a href="https://www.thoughtworks.com/what-we-do/data-and-ai/cd4ml/guide-to-evaluating-mlops-platforms1?utm_source=linkedin&utm_medium=social-organic&utm_campaign=tw-webinars_2021-12&gh_src=463a2f181us" target="_blank" rel="nofollow noopener noreferrer">this webinar</a> from <a href="https://www.linkedin.com/in/ryan-dawson-501ab9123/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ryan Dawson</strong></a> on CD4ML covering the process of identifying the best tools for your team's needs.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/69df8067190aee55adfb3b41c1bc2d0e/39600/ryan-dawson-thoughtworks-cd4ml.png" alt="MLOPs Tool evaluation process" title="MLOPs Tool evaluation process" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ryan Dawson's MLOps tool evaluation process (<a href="https://www.thoughtworks.com/what-we-do/data-and-ai/cd4ml/guide-to-evaluating-mlops-platforms1?utm_source=linkedin&utm_medium=social-organic&utm_campaign=tw-webinars_2021-12&gh_src=463a2f181us" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p><a href="https://www.linkedin.com/in/deanpleban/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dean Pleban</strong>,</a> CEO of <a href="https://dagshub.com" target="_blank" rel="nofollow noopener noreferrer">DAGsHub,</a> also gave a great talk on a decision making framework for deciding on your tools in his presentation at <a href="https://devopsdays.org/events/2021-tel-aviv/welcome/" target="_blank" rel="nofollow noopener noreferrer">DevOpsDays Tel Aviv</a>. In this talk you will learn guidelines and mental models that will help you choose tools in whatever stage of the process you are in.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/XLc733qO2lE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="in-other-data-science-and-ai-news" style="position:relative;">In Other Data Science and AI News<a href="#in-other-data-science-and-ai-news" aria-label="in other data science and ai news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="rob-toews-ai-predictions-in-forbes" style="position:relative;">Rob Toews AI Predictions in Forbes<a href="#rob-toews-ai-predictions-in-forbes" aria-label="rob toews ai predictions in forbes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.twitter.com/_RobToews" target="_blank" rel="nofollow noopener noreferrer"><strong>Rob Toews</strong></a> wrote <a href="https://www.forbes.com/sites/robtoews/2021/12/22/10-ai-predictions-for-2022/?sh=559c4c8d482d" target="_blank" rel="nofollow noopener noreferrer">10 AI Predictions for 2022</a> for <a href="https://forbes.com" target="_blank" rel="nofollow noopener noreferrer">Forbes.</a> In it he predicts more startups getting funded in NLP than any other category, reinforcement learning to become increasingly important, the rise of synthetic data, and powerful new AI tools being built for video. My favorite prediction:</p> <blockquote> <p>Responsible AI' will begin to shift from a vague catch-all term to an operationalized set of enterprise practices."<br> That's good news!</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://www.forbes.com/sites/robtoews/2021/12/22/10-ai-predictions-for-2022/?sh=559c4c8d482" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">10 AI Predictions for 2022</h4> <div class="elp-description">Rob Toews predicts the rise of NLP, reinforcement learning, operationalized responsible AI and more.</div> <div class="elp-link">https://forbes.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-01-18/forbes-8b4621c09667e17823a38d9a3b116086.jpeg" alt="10 AI Predictions for 2022"> </div> </a> </section> <p></p> <h3 id="chip-huyens-latest-blog-post" style="position:relative;">Chip Huyen's Latest Blog Post<a href="#chip-huyens-latest-blog-post" aria-label="chip huyens latest blog post permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You may remember <a href="https://twitter.com/chipro" target="_blank" rel="nofollow noopener noreferrer"><strong>Chip Huyen</strong></a> from <a href="https://huyenchip.com/2020/12/30/mlops-v2.html" target="_blank" rel="nofollow noopener noreferrer">MLOps Tooling Landscape v2</a> and <a href="https://docs.google.com/presentation/d/15ZrLFzimfy-8ob7mJ0qHPNyVoTtSfKBF5gPPG5f0Lz8/edit#slide=id.p" target="_blank" rel="nofollow noopener noreferrer">DVC's inclusion in her Machine Learning Systems Design Lecture series</a>. But at the turn of the new year, she published a new blog post entitled <a href="https://huyenchip.com/2022/01/02/real-time-machine-learning-challenges-and-solutions.html" target="_blank" rel="nofollow noopener noreferrer">Real-time machine learning: Challenges and Solutions.</a> The article describes her learning from working with approximately 30 companies in different industries doing real-time machine learning. She describes the online prediction processes of batch prediction and streaming prediction.</p> <p>Additionally she discusses continual learning and the difference between stateless retraining (the model is trained from scratch each time), and stateful training (the model continues training on new data) and moving from a manual process to a more automated one. Definitely worth a read and we believe DVC and CML can help you with your stateful training!</p> <p>She and her team are running a <a href="https://forms.gle/dDvgF7QgpPdvJE5b8" target="_blank" rel="nofollow noopener noreferrer">survey</a> to better understand the adoption and challenges of real-time ML. We enourage your participation!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7ccdb3437ca25f7b5784bf6422f2a300/39600/stateful-training.png" alt="Stateful Training" title="Stateful Training" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Chip Huyen's Stateless vs.Stateful Training (<a href="https://huyenchip.com/2022/01/02/real-time-machine-learning-challenges-and-solutions.html" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="vicki-boykis-top-three-fundamental-tools-for-a-machine-learning-engineer" style="position:relative;">Vicki Boykis' Top three Fundamental Tools for a Machine Learning Engineer<a href="#vicki-boykis-top-three-fundamental-tools-for-a-machine-learning-engineer" aria-label="vicki boykis top three fundamental tools for a machine learning engineer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 300px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2bcc8468ee8185a0b26767d0b75e6526/03346/git-sql-cli.jpg" alt="Git, SQL, CLI" title="Git, SQL, CLI =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> If you're interested in becoming a machine learning engineer and you're not familiar with <a href="https://twitter.com/vboykis" target="_blank" rel="nofollow noopener noreferrer"><strong>Vicki Boykis</strong>,</a> you should be. She has an amazing blog with years of well-written, funny, technical content on machine learning. Her latest piece entitled <a href="https://vickiboykis.com/2022/01/09/git-sql-cli/" target="_blank" rel="nofollow noopener noreferrer">Git, SQL, CLI</a> tells why she thinks these three tools are fundamental tools for any technical job. We think so too.</p> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="our-online-course-is-live-" style="position:relative;">Our Online Course is Live! 🎉<a href="#our-online-course-is-live-" aria-label="our online course is live permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>You can register for the FREE new course <a href="https://learn.dvc.org" target="_blank" rel="nofollow noopener noreferrer">here on the Iterative website</a>. The course is currently in beta mode. We already have some things we are working on to make it even better, but we would love your feedback! 🙏🏼 So far we have had some minor glitches and a lot of positive feedback! But we want your critiques too!</p> <p><strong>Whoever can give us feedback on any three modules by February 6th will receive some fresh new swag!</strong></p> <p>We are already planning our next course!</p> <h2 id="experiment-versioning-piece-in-kdnuggets" style="position:relative;">Experiment Versioning piece in KDNuggets<a href="#experiment-versioning-piece-in-kdnuggets" aria-label="experiment versioning piece in kdnuggets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our Senior Developer Advocate <a href="https://twitter.com/mariaKhalusova" target="_blank" rel="nofollow noopener noreferrer"><strong>Maria Khalusova</strong></a> wrote a tutorial piece on <code>exp init</code> and experiment versioning entitled <a href="https://www.kdnuggets.com/2021/12/versioning-machine-learning-experiments-tracking.html" target="_blank" rel="nofollow noopener noreferrer">Versioning Machine Learning Experiments vs Tracking Them.</a> The command helps you quickly set up a pipeline and codify your experiments with all of the factors that contributed to each of them, including data, code, pipeline, model version and all hyperparameters. This is a step above other experiment tracking tools and enables you to achieve true reproducibility.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.kdnuggets.com/2021/12/versioning-machine-learning-experiments-tracking.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Versioning Machine Learning Experiments vs Tracking Them</h4> <div class="elp-description">Maria Khalusova's tutorial on DVC's `exp init` command and the next level of experiment tracking that delivers true reproducibility.</div> <div class="elp-link">https://kdnuggets.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2022-01-18/kdnuggets-1a388aac267d8ec89f41ff66516c76bc.jpeg" alt="Versioning Machine Learning Experiments vs Tracking Them"> </div> </a> </section> <p></p> <h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have a few new team members this month!</p> <p><a href="https://github.com/dtrifiro" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniele Trifirò</strong></a> is our first team member from Italy! He joins us as a Senior Software Engineer. Daniele has a background in Physics/Astrophysics and worked for 4 years as a researcher in the LIGO Scientific collaboration and then went on to positions at Cloudian and illimity. It was at illimity where he "fell in love" with DVC! In his free time Daniele likes listening to and sometimes playing music himself, as well as rock climbing. 🧗🏼‍♂️</p> <p><a href="https://github.com/yathomasi" target="_blank" rel="nofollow noopener noreferrer"><strong>Thomas Kunwar</strong></a> is a software engineer joining the team from Nepal. He's been working as a fullstack developer specializing in the MERN stack and has lead a team on multiple projects. In his free time Thomas enjoys trekking, watching and playing sports, watching movies, and learning. Welcome Thomas! 👏🏼</p> <p><a href="https://github.com/madhur-tandon" target="_blank" rel="nofollow noopener noreferrer"><strong>Madhur Tandon</strong></a> joins our team as a Software Engineer from Delhi, India. He is active in open source and some of his famous contributions are to projects such as Pyodide (the Python Scientific Stack compiled to WebAssembly) and Jupyterlite (a Jupyter distribution running in the browser). He has also been a speaker in PyData and JupyterCon. Talk to him about his solo trip to SF, his experiences at Mozilla or about books, Indian governance, food, and crypto. When not working, he is working out!💪🏼</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Even with these amazing new additions to the team, we're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions and share with anyone you think may be interested! 🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b575ad2dddeaef4c8a1e475f80cc5ca2/03346/hiring.jpg" alt="Iterative.ai is Hiring" title="Iterative.ai is Hiring" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative is Hiring (<a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="upcoming-events" style="position:relative;">Upcoming Events<a href="#upcoming-events" aria-label="upcoming events permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="january-office-hours" style="position:relative;">January Office Hours!<a href="#january-office-hours" aria-label="january office hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Be sure to join us at the <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" target="_blank" rel="nofollow noopener noreferrer">January Office Hours Meetup,</a> where <a href="https://www.linkedin.com/in/gennarotedesco/" target="_blank" rel="nofollow noopener noreferrer"><strong>Gennaro Todesco</strong>,</a> Senior Data Scientist at <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io,</a> will present his workflow with DVC and CML. <a href="https://www.linkedin.com/in/tezan-sahu/" target="_blank" rel="nofollow noopener noreferrer"><strong>Tezan Sahu</strong>,</a> will follow presenting a workflow from a series of tutorials that we shared from him in the <a href="https://dvc.org/blog/september-21-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">September Heartbeat,</a> including DVC, PyCaret, MLFlow and FastAPI.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">January Office Hours Meetup - 2 workflows</h4> <div class="elp-description">RSVP for DVC Office Hours - 2 Workflows with integrations including Neovim, PyCaret, MLFlow and FastAPI!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-12-15/office-hours-meetup-07ea44242950433d0f1062e2bd5ef52f.png" alt="January Office Hours Meetup - 2 workflows"> </div> </a> </section> <p></p> <h3 id="milecia-mc-gregor-at-conf-42" style="position:relative;">Milecia Mc Gregor at Conf 42<a href="#milecia-mc-gregor-at-conf-42" aria-label="milecia mc gregor at conf 42 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><span class="gatsby-resp-image-wrapper image-wrap-left" style="position: relative; display: block; ; ; max-width: 375px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e99cd124f97f8c84b053a3c79c40a84e/39600/Conf42.png" alt="Conf42" title="Milecia McGregor at Conf42 =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Don't miss <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> at the upcoming <a href="https://www.conf42.com/Python_2022_Milecia_McGregor_reproducible_experiments_better_ml_models" target="_blank" rel="nofollow noopener noreferrer">Conf42</a> on January 27th! She will be presenting her talk on Using Reproducible Experiments To Create Better Machine Learning Models. If you haven't caught this talk yet, now's the time!</p> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/December-21-community-gemshttps://dvc.org/blog/December-21-community-gemsTue, 21 Dec 2021 00:00:00 GMT<h3 id="im-using-google-drive-as-a-remote-storage-and-accidentally-entered-the-verification-from-the-wrong-google-account-how-can-i-edit-that" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/908437162150739978" target="_blank" rel="nofollow noopener noreferrer">I'm using Google Drive as a remote storage and accidentally entered the verification from the wrong Google account. How can I edit that?</a><a href="#im-using-google-drive-as-a-remote-storage-and-accidentally-entered-the-verification-from-the-wrong-google-account-how-can-i-edit-that" aria-label="im using google drive as a remote storage and accidentally entered the verification from the wrong google account how can i edit that permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>No problem @fireballpoint1! This happens sometimes.</p> <p>You should be able to run the following command in your terminal and then re-enter your credentials.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">rm</span> .dvc/tmp/gdrive-user-credentials.json</span></code></pre></div> <p>That should give you a chance to enter the correct credentials when you try to <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> again.</p> <h3 id="can-i-add-a-dvc-remote-which-refers-to-nas-by-ip-so-i-dont-have-to-mount-on-every-computer" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/912667503283564544" target="_blank" rel="nofollow noopener noreferrer">Can I add a <code>dvc remote</code> which refers to NAS by IP so I don't have to mount on every computer?</a><a href="#can-i-add-a-dvc-remote-which-refers-to-nas-by-ip-so-i-dont-have-to-mount-on-every-computer" aria-label="can i add a dvc remote which refers to nas by ip so i dont have to mount on every computer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>That's a new question for us @Krzysztof Begiedza!</p> <p>If you enable the SSH service on your NAS, you can configure DVC to use it as an SSH remote with <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a>.</p> <p>There should also be DSM (Synology DiskStation Manager) packages for webdav as well, if you prefer that over SSH. Just make sure that when you run <a href="https://dvc.org/doc/command-reference/remote/add#-d"><code>dvc remote add -d storage <URL></code></a>, your remote storage URL looks similar to this.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">webdav://<ip>/<path></code></pre></div> <h3 id="can-you-selectively-dvc-pull-data-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/913713923667148850" target="_blank" rel="nofollow noopener noreferrer">Can you selectively <code>dvc pull</code> data files?</a><a href="#can-you-selectively-dvc-pull-data-files" aria-label="can you selectively dvc pull data files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Great question @Clemens!</p> <p>You would run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull <file></code></a> to get the files you want. You could also use the <code>--glob</code> option on <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and DVC will only pull the relevant files.</p> <p>The command for that pull would be similar to this.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> path/to/specific/file</span></code></pre></div> <p>You could also make a <a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry</a> and use <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> in other projects to get a specific dataset. That way you don't have to do a granular pull.</p> <h3 id="what-is-the-fastest-way-to-get-the-specific-value-of-a-metric-of-an-experiment-based-on-experiment-id" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/916328260856590346" target="_blank" rel="nofollow noopener noreferrer">What is the fastest way to get the specific value of a metric of an experiment based on experiment id?</a><a href="#what-is-the-fastest-way-to-get-the-specific-value-of-a-metric-of-an-experiment-based-on-experiment-id" aria-label="what is the fastest way to get the specific value of a metric of an experiment based on experiment id permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>That's a really good question @Kwon-Young!</p> <p>You can always look through experiment metrics with <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> and this shows you all of the experiments you've run.</p> <p>To get the metrics for a specific experiment or set of experiments, you'll need the experiment ids and then you can use the Python API like this example.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo dvc <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token string">"."</span><span class="token punctuation">)</span> <span class="token comment"># or Repo("path/to/repo/dir")</span> metrics <span class="token operator">=</span> dvc<span class="token punctuation">.</span>metrics<span class="token punctuation">.</span>show<span class="token punctuation">(</span>revs<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"exp-name1"</span><span class="token punctuation">,</span> <span class="token string">"exp-name2"</span><span class="token punctuation">,</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div> <p>This returns a Python dictionary that contains what gets displayed in <a href="https://dvc.org/doc/command-reference/metrics/show#--json"><code>dvc metrics show --json</code></a> except you're able to specify the experiments you want.</p> <h3 id="is-it-possible-to-run-the-whole-pipeline-but-only-for-one-element-of-the-foreach" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/915986804577026088" target="_blank" rel="nofollow noopener noreferrer">Is it possible to run the whole pipeline but only for one element of the <code>foreach</code>?</a><a href="#is-it-possible-to-run-the-whole-pipeline-but-only-for-one-element-of-the-foreach" aria-label="is it possible to run the whole pipeline but only for one element of the foreach permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Another great question from @vgodie!</p> <p>If your stages look something like this for example:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">cleanups</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token comment"># List of simple values</span> <span class="token punctuation">-</span> raw1 <span class="token punctuation">-</span> labels1 <span class="token punctuation">-</span> raw2 <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> clean.py "$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>" <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.cln <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">3</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span> <span class="token punctuation">-</span> <span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">10</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">15</span> <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py $<span class="token punctuation">{</span>item.epochs<span class="token punctuation">}</span> $<span class="token punctuation">{</span>item.thresh<span class="token punctuation">}</span></code></pre></div> <p>You should try the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> cleanups@labels1</span></code></pre></div> <p>This will run your whole pipeline, but only with <code>labels1</code> in the <code>cleanups</code> stage.</p> <h3 id="is-it-possible-to-pull-experiments-from-the-remote-without-checking-out-the-base-commit-of-those-experiments" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/910481311905505290" target="_blank" rel="nofollow noopener noreferrer">Is it possible to pull experiments from the remote without checking out the base commit of those experiments?</a><a href="#is-it-possible-to-pull-experiments-from-the-remote-without-checking-out-the-base-commit-of-those-experiments" aria-label="is it possible to pull experiments from the remote without checking out the base commit of those experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @mattlbeck!</p> <p>You should be able to do this with <a href="https://dvc.org/doc/command-reference/exp/pull#-name"><code>dvc exp pull origin exp-name</code></a>.</p> <p>If you have experiments with the same name on different commits, using <code>exp-name</code> won't work since it defaults to selecting the one based on your current commit if there are duplicate names.</p> <p>To work around this, you can use the full refname, like <code>refs/exps/e7/78ad744e8d0cd59ddqc65d5d698cf102533f85/exp-6cb7</code>, to specify the experiments that you want to work with.</p> <h3 id="how-should-i-handle-checkpoints-in-pytorch-lightning-with-dvclive" style="position:relative;"><a href="https://drive.google.com/file/d/1t0wPowk-PUinNjV4xchrzPZh7xsI8i37/view?usp=sharing" target="_blank" rel="nofollow noopener noreferrer">How should I handle checkpoints in PyTorch Lightning with DVCLive?</a><a href="#how-should-i-handle-checkpoints-in-pytorch-lightning-with-dvclive" aria-label="how should i handle checkpoints in pytorch lightning with dvclive permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a really good question that came from one of our Office Hours talks! Thanks <a href="https://www.linkedin.com/in/sirily/" target="_blank" rel="nofollow noopener noreferrer">Ilia Sirotkin</a>!</p> <p>We have an <a href="https://github.com/iterative/dvclive/issues/170" target="_blank" rel="nofollow noopener noreferrer">open issue</a> we encourage you to follow for more details and to even contribute!</p> <p>Python Lightning handles checkpoints differently from other libraries. This affects the way metrics logging is executed and how models are saved.</p> <p>You can write a custom callback to control saving everything and track it with DVC and this is the workaround we suggest. You can implement the <code>after_save_checkpoint</code> method and save the model file.</p> <p>The way this works is by breaking your training process into small stages. You should specify the stage’s checkpoint as the output of the stage and set it as a dependency for the next stage. That way if something breaks, the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command will resume your experiment from the last stage.</p> <p>Your pipeline might look something like this:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">stage_0</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> checkpoints/checkpoint_epoch=0.ckpt <span class="token key atrule">next</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">prev</span><span class="token punctuation">:</span> <span class="token number">0</span> <span class="token key atrule">next</span><span class="token punctuation">:</span> <span class="token number">1</span> <span class="token punctuation">-</span> <span class="token key atrule">prev</span><span class="token punctuation">:</span> <span class="token number">1</span> <span class="token key atrule">next</span><span class="token punctuation">:</span> <span class="token number">2</span> <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token punctuation">-</span><span class="token punctuation">-</span>checkpoint $<span class="token punctuation">{</span>item.prev<span class="token punctuation">}</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> checkpoints/checkpoint_epoch=$<span class="token punctuation">{</span>item.prev<span class="token punctuation">}</span>.ckpt <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> checkpoints/checkpoint_epoch=$<span class="token punctuation">{</span>item.next<span class="token punctuation">}</span>.ckpt</code></pre></div> <p>Then you'll need to reuse the <code>ModelCheckpoint</code> that is included in <code>pytorch_lightning</code> to capture the checkpoints. Here's a snippet of what that could look like in your training script:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token comment"># set checkpoint path</span> ckpt_path <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>abspath<span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>dirname<span class="token punctuation">(</span>__file__<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"checkpoints"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># checkpoints will be saved to checkpoints/checkpoint_epoch={epoch_number}.ckpt</span> cp <span class="token operator">=</span> pl<span class="token punctuation">.</span>callbacks<span class="token punctuation">.</span>model_checkpoint<span class="token punctuation">.</span>ModelCheckpoint<span class="token punctuation">(</span> monitor<span class="token operator">=</span><span class="token string">"train_loss_epoch"</span><span class="token punctuation">,</span> save_top_k<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">,</span> dirpath<span class="token operator">=</span>ckpt_path<span class="token punctuation">,</span> filename<span class="token operator">=</span><span class="token string">'checkpoint_{epoch}'</span><span class="token punctuation">)</span></code></pre></div> <h3 id="is-there-a-feature-for-dvc-to-only-sample-and-cache-a-subset-of-the-tracked-dataset-eg-10000-lines-of-a-large-file" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/917778575845900340" target="_blank" rel="nofollow noopener noreferrer">Is there a feature for DVC to only sample and cache a subset of the tracked dataset, e.g. 10000 lines of a large file?</a><a href="#is-there-a-feature-for-dvc-to-only-sample-and-cache-a-subset-of-the-tracked-dataset-eg-10000-lines-of-a-large-file" aria-label="is there a feature for dvc to only sample and cache a subset of the tracked dataset eg 10000 lines of a large file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Really great question @Abdi!</p> <p>You should be able to use the streaming capability of the DVC API to achieve this goal.</p> <p>Here is an example of a Python script that would do this:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>api <span class="token keyword">import</span> <span class="token builtin">open</span> <span class="token keyword">as</span> dvcopen <span class="token keyword">with</span> dvcopen<span class="token punctuation">(</span><span class="token string">'data'</span><span class="token punctuation">,</span><span class="token string-interpolation"><span class="token string">f'</span><span class="token interpolation"><span class="token punctuation">{</span>repo_url<span class="token punctuation">}</span></span><span class="token string">'</span></span><span class="token punctuation">)</span> <span class="token keyword">as</span> fd<span class="token punctuation">:</span> <span class="token keyword">for</span> line <span class="token keyword">in</span> fd<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span></code></pre></div> <hr> <p><img src="https://media.giphy.com/media/h5Ct5uxV5RfwY/giphy.gif" alt="Done Tyler The Creator GIF"></p> <p>At our January Office Hours Meetup we will be looking at machine learning workflows and Neovim-DVC plugin! <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/december-21-heartbeathttps://dvc.org/blog/december-21-heartbeatWed, 15 Dec 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We've made it to the end of the year! 2021 has been an amazing journey for us and our growing Community all over the world. There's lots of great news this month. Let's not waste a heartbeat and get right to it! 😉</p> <p><img src="https://media.giphy.com/media/YAIOuXv2zYDW8/giphy.gif" alt="Heartbeat!"></p> <h2 id="dvc--cml--rasa--️" style="position:relative;">DVC + CML + RASA = ❤️<a href="#dvc--cml--rasa--%EF%B8%8F" aria-label="dvc cml rasa ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/m_a_upson" target="_blank" rel="nofollow noopener noreferrer"><strong>Matthew Upson</strong></a>, Founder at <a href="https://mantisnlp.com/" target="_blank" rel="nofollow noopener noreferrer">MantisNLP,</a> an AI consultancy focused on NLP, along with his team, put out the <a href="https://medium.com/mantisnlp/mlops-for-conversational-ai-with-rasa-dvc-and-cml-part-i-beec756e8e7f" target="_blank" rel="nofollow noopener noreferrer">first blog post</a> in a series showing how to use DVC and CML along with Rasa in developing conversational AI. This post sets the scene for the following more detailed parts, but lays out DVC's use for generating the DAG as well as logging metrics and using CML to do the testing. We're looking forward to the next installments!</p> <p><img src="https://media.giphy.com/media/HYrBxW4xsPSP3wsUTk/giphy.gif" alt="Heartbeat!"></p> <h2 id="curious-about-speaker-diarization" style="position:relative;">Curious about Speaker Diarization?<a href="#curious-about-speaker-diarization" aria-label="curious about speaker diarization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://blogs.cisco.com/developer/speakerdiarization01" target="_blank" rel="nofollow noopener noreferrer">The co-authored article entitled,</a> “Who Said That?” A Technical Intro to Speaker Diarization," by <a href="https://www.linkedin.com/in/dariocazzani/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dario Cazzani</strong></a>, and <a href="https://github.com/alhuang10" target="_blank" rel="nofollow noopener noreferrer"><strong>Alex Huang</strong></a>, machine learning engineers at <a href="https://www.cisco.com/" target="_blank" rel="nofollow noopener noreferrer">Cisco,</a> provides an introduction to the topic of Speaker Diarization, or who spoke when, in audio recordings. Their team's solution takes you through the fingerprinting of voices, clustering to assign speaker labels, creating the needed data pipeline, and the integration with Webex.</p> <p>In this process, the team derives benefit from using DVC to version data and models, as well as easily collaborate with each other and the transcription team. More info on this project can be found <a href="https://github.com/CiscoDevNet/vo-id#train-the-vectorizer" target="_blank" rel="nofollow noopener noreferrer">in their repo.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eb6417a6abeb72008cb2c97e3cf72fad/39600/Dario-Cazzani-2.png" alt="Speaker Diarization" title="Speaker Diarization" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Dario Cazzani and team's process for assinging speaker labels to audio files (<a href="https://blogs.cisco.com/developer/speakerdiarization01" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="using-dvc-in-academic-research-on-a-compartmental-infectious-disease-model" style="position:relative;">Using DVC in Academic Research on a Compartmental Infectious Disease Model<a href="#using-dvc-in-academic-research-on-a-compartmental-infectious-disease-model" aria-label="using dvc in academic research on a compartmental infectious disease model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/matthew-segal-aa132093/" target="_blank" rel="nofollow noopener noreferrer"><strong>Matthew Segal</strong>,</a> <a href="https://mattsegal.dev/devops-academic-research.html" target="_blank" rel="nofollow noopener noreferrer">in his post,</a> "DevOps in Academic Research," reviews his work of applying some of the tried and true practices in DevOps to data science projects using a <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo" target="_blank" rel="nofollow noopener noreferrer">Markov chain Monte Carlo</a> (MCMC) technique to create a model to simulate the spread of tuberculosis and later, as the pandemic erupted, COVID-19.</p> <p>The article covers mapping the workflow (see below), testing the codebase, smoke tests <a href="https://mattsegal.dev/pytest-on-github-actions.html" target="_blank" rel="nofollow noopener noreferrer">(with a guide link),</a> contiunuous integration, and data management (where he recommends DVC).</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf57bc814a51e86085b09a83a3717d48/39600/matt-segal.png" alt="Map Pipeline" title="Map Pipeline" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Working to develop a pipeline (<a href="https://mattsegal.dev/devops-academic-research.html" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="are-you-confused-by-how-many-mlops-tools-there-are" style="position:relative;">Are you confused by how many MLOps tools there are?<a href="#are-you-confused-by-how-many-mlops-tools-there-are" aria-label="are you confused by how many mlops tools there are permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><span class="gatsby-resp-image-wrapper image-wrap-right" style="position: relative; display: block; ; ; max-width: 450px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f9c0dab3e5e841de11778d0a64e7b89e/39600/thoughtworks-mlops-landscape.png" alt="Thoughtworks Trianglethoughtwork" title="Thoughtworks Platform vs. Specialist Triangle =" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Well <a href="https://www.thoughtworks.com/?utm_source=google-search&utm_medium=paid-media&utm_campaign=always-on-brand_2021-11&utm_term=thoughtworks&utm_content=RSAad1&gclid=Cj0KCQiA2NaNBhDvARIsAEw55hg2li5srltu8ppVsxLzcnv-pYWRmvnCk_jmljiC2ocyM4tc7EUEt9gaAoVWEALw_wcB" target="_blank" rel="nofollow noopener noreferrer">Thoughtworks</a> included DVC in its recent <a href="https://www.thoughtworks.com/what-we-do/data-and-ai/cd4ml/guide-to-evaluating-mlops-platforms" target="_blank" rel="nofollow noopener noreferrer">Thoughtworks Guide to MLOps Platforms</a>. While being included is great, things move so fast that they seemed to have missed our experiment capabilities and the CI/CD capabilities for machine learning of CML.🤔</p> <p>And if they only knew what's to come! 🚀 Lots planned in the new year!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ca7a536e93ac17fbb71c711a1dd1738f/c6e3d/more-tools.png" alt="They don't know DVC has more tools coming" title="They don't know DVC has more tools coming" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Dmitry Petrov's meme (<a href="https://twitter.com/FullStackML/status/1465428233336201218?s=20" target="_blank" rel="nofollow noopener noreferrer">Source Link</a>)</em></p> <h2 id="what-is-mlops---everything-you-must-know-to-get-started" style="position:relative;">What is MLOps - Everything You Must Know to Get Started<a href="#what-is-mlops---everything-you-must-know-to-get-started" aria-label="what is mlops everything you must know to get started permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In his post, <a href="https://towardsdatascience.com/what-is-mlops-everything-you-must-know-to-get-started-523f2d0b8bd8" target="_blank" rel="nofollow noopener noreferrer">What is MLOps - Everything You Need to Know to Get Started,</a> <a href="https://www.linkedin.com/in/tyagiharshit/" target="_blank" rel="nofollow noopener noreferrer"><strong>Harshit Tyagi</strong></a> provides an overview of MLOps and why it's necessary for today's ML and AI to production projects. You will learn the different parts of the puzzle that make up MLOps, and review the machine learning life cycle. In the post, Harshit also provides a video of the concepts as well as an interview with our CEO, <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong>.</a> Be sure to check it out!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/84d1c5a8e04dc183f95961ae2cb797b9/03346/harshit-tyagi.jpg" alt="What is MLOps" title="What is MLOps" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Harshit Tyagi's ML Systems Engineering and Operations with their Stakeholders (<a href="https://towardsdatascience.com/what-is-mlops-everything-you-must-know-to-get-started-523f2d0b8bd8" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="using-antipatterns-to-avoid-mlops-mistakes" style="position:relative;">Using AntiPatterns to avoid MLOps Mistakes<a href="#using-antipatterns-to-avoid-mlops-mistakes" aria-label="using antipatterns to avoid mlops mistakes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/nikhilmuralidhar/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nikhil Maralidhar</strong>,</a> et. al., in their survey paper, <a href="https://arxiv.org/abs/2107.00079" target="_blank" rel="nofollow noopener noreferrer">Using AntiPatterns to avoid MLOps Mistakes,</a> aim to develop a vocabulary for anti-patterns found in machine learning projects in the financial services industry. In the paper, they also give recommendations for acheiving MLOps at an enterprise scale using processes for documentation and management. Luckily, our tools help you to solve some of these challenges!</p> <p>You can also catch Nikhil's interview with <a href="https://twitter.com/bigdata" target="_blank" rel="nofollow noopener noreferrer"><strong>Ben Lorica</strong></a> from <a href="https://thedataexchange.media/" target="_blank" rel="nofollow noopener noreferrer">The Data Exchange</a> <a href="https://thedataexchange.media/mlops-anti-patterns/" target="_blank" rel="nofollow noopener noreferrer">podcast here.</a></p> <p> </p><section class="elp-content-holder"> <a href="https://arxiv.org/abs/2107.00079" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Using AntiPatterns to avoid MLOps Mistakes</h4> <div class="elp-description">Nikhil Maralidhar, et. al. paper on AntiPatterns in MLOps in the Financial Services industry and recommendations for improving machine learning operations.</div> <div class="elp-link">https://arxiv.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-12-15/arxiv-9d99ec14ee87d2be7259ac0639bf93f9.png" alt="Using AntiPatterns to avoid MLOps Mistakes"> </div> </a> </section> <p></p> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="new-team-member" style="position:relative;">New Team Member<a href="#new-team-member" aria-label="new team member permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/iamritghimire/" target="_blank" rel="nofollow noopener noreferrer"><strong>Amrit Ghimire</strong></a> joins our Studio team as a back end developer, from Pokhara, Nepal. Prior to joining Iterative, he lead a team at Leapfrog, Inc. to develop applications for a drug discovery company. Amrit likes to read and watch movies in this free time and works to complete reading 3-4 books per month. Finally he enjoys working in Python, Rust and customizing Linux systems and personal command line automations. Welcome Amrit! 🎉</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As always, we're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions including:</p> <ul> <li>Senior Software Engineer (ML, Labeling, Python)</li> <li>Senior FronteEnd Engineer (Typescript, Node, React)</li> <li>Senior Software Engineer (ML, DevTools, Python)</li> <li>Senior Software Engineer (ML, Data Infra, GoLang)</li> <li>Field Data Scientist / Sales Engineer</li> <li>Developer Advocate (Machine Learning)</li> <li>Director / VP of Engineering (ML, DevTools)</li> <li>Director / VP of Product (ML, Data Infra, SaaS)</li> <li>Head of Talent</li> <li>Head of DevRel</li> <li>Account Executive (Sales)</li> </ul> <p>Please pass this info on to anyone you know that may fit the bill. Come join our rocket ship! 🚀</p> <p><img src="https://media.giphy.com/media/3xz2BzSNxkwPqF8Wdy/giphy.gif" alt="Go Team Nasa GIF"></p> <h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The DVC team has been steadily adding to the Experiment Management section of our docs. We want to make sure that all your experiment versioinng needs are met and there's more to come! 🚀</p> <p><img src="https://media.giphy.com/media/5qy3GWYwCydByEn3O6/giphy.gif" alt="Dvc GIF"></p> <p>And don't miss <a href="https://dvc.org/doc/use-cases/experiment-tracking" target="_blank" rel="nofollow noopener noreferrer">the latest Use Case on Machine Learning Experiment Tracking,</a> which outlines going from the traditional, painful, note taking, to more advanced methods, and compares how DVC can take you to the next level!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7283f3d9051c553163d7643fb6e936f0/39600/natural-experimentation.png" alt="Machine Learning Experiment Tracking" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Tired of this? Check out our docs! (<a href="https://dvc.org/doc/use-cases/experiment-tracking" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="dvc-online-course-update" style="position:relative;">DVC Online Course Update!<a href="#dvc-online-course-update" aria-label="dvc online course update permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The course is in editing mode and this week we are getting the second cuts for review. The first course will focus on DVC for Data Scientists and Analysts. The course is on track to be out by the end of the year! It will be 100% <strong>FREE</strong> and available from our websites. We are so excited about how it's coming to life! 🚀</p> <p>👀 Note the the Udemy channel in Discord has now changed to #iterative-online-course. We're getting ready!</p> <p><img src="https://media.giphy.com/media/xUOxfh6ZM75efM3Bqo/giphy.gif" alt="You Can Do It GIF by chuber channel"></p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Be sure to join us at the <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" target="_blank" rel="nofollow noopener noreferrer">January Office Hours Meetup,</a> where <a href="https://www.linkedin.com/in/gennarotedesco/" target="_blank" rel="nofollow noopener noreferrer"><strong>Gennaro Todesco</strong>,</a> Senior Data Scientist at <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io,</a> will present his workflow with DVC and CML, and his Neovim-DVC plugin. <a href="https://www.linkedin.com/in/tezan-sahu/" target="_blank" rel="nofollow noopener noreferrer"><strong>Tezan Sahu</strong>,</a> will follow presenting a workflow from a series of tutorials that we shared from him in the <a href="https://dvc.org/blog/september-21-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">September Heartbeat,</a> including DVC, PyCaret, MLFlow and FastAPI.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282663146/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">January Office Hours Meetup - 2 workflows</h4> <div class="elp-description">RSVP for DVC Office Hours - 2 Workflows with integrations including Neovim, PyCaret, MLFlow and FastAPI!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-12-15/office-hours-meetup-07ea44242950433d0f1062e2bd5ef52f.png" alt="January Office Hours Meetup - 2 workflows"> </div> </a> </section> <p></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There were many candidates this month. Check out our Testimonials Wall of Love, which is now live on our <a href="https://dvc.org/community" target="_blank" rel="nofollow noopener noreferrer">Community Page</a> and holds many of our favorite Tweets! If you'd like to give a shout our for our tools <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">head here</a> to make a video or written testimonial. We'd appreciate it! 🙏🏼</p> <p>But for this month, this Tweet wins the coveted Tweet Love slot.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Playing with Data Science Version control (DVC) from <a href="https://twitter.com/Iterativeai">@Iterativeai</a> - amazing how much it has progressed since I looked at it a couple of years ago</p>— Chris Samiullah (@ChrisSamiullah) <a href="https://twitter.com/ChrisSamiullah/status/1461702483965886468">November 19, 2021</a></blockquote> <h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>And with that we close out the year! We send a huge thank you to all of our Community members that help us make our tools better. Thank you for your contributions, trust and feedback! We look forward to continue to grow with you in 2022! Have a wonderful holiday season and Happy New Year! 🎉</p> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/collaborative-experimentshttps://dvc.org/blog/collaborative-experimentsMon, 13 Dec 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Sharing experiments to compare machine learning models is important when you're working with a team of engineers. You might need to get another opinion on an experiments results. You might need to share a modified dataset or even share the exact reproduction of a specific experiment.</p> <p>Setting up DVC remotes in addition to your Git remotes lets you share all of the data, code, and hyperparameters associated with each experiment so anyone can pick up where you left off in the training process. We'll go through an example of sharing an experiment with DVC remotes.</p> <h2 id="forking-the-project" style="position:relative;">Forking the project<a href="#forking-the-project" aria-label="forking the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To follow along, fork <a href="https://github.com/iterative/example-dvc-experiments" target="_blank" rel="nofollow noopener noreferrer">this repo</a> as one of your own GitHub repos. That way you'll have pull access when we start working with DVC. This repo has different tags that show the progression of the project and you're welcome to check them out!</p> <p>To get the branch we'll use in this post, you can run this command to clone your forked repo. Make sure to replace <code><your_github></code> with your GitHub name.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> [email protected]:<span class="token operator"><</span>your_github<span class="token operator">></span>/example-dvc-experiments.git <span class="token parameter variable">-b</span> get-started</span></code></pre></div> <p>This project already has DVC files set up to run experiments, but if you want to follow along with a project you're currently working on, make sure to check out the steps to initialize a DVC pipeline in <a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">the Getting Started doc</a>.</p> <h2 id="setting-up-your-dvc-remotes" style="position:relative;">Setting up your DVC remotes<a href="#setting-up-your-dvc-remotes" aria-label="setting up your dvc remotes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When you want to share the progress you've made with training your model, that usually means you need to find a way to bundle the code, data, and hyperparameters. This could be a complicated process if you're working with GBs worth of data or you have a large number of hyperparameters.</p> <p>That's one of the uses for DVC and why we'll be working with remotes. To start with, make sure your GitHub remote is configured correctly. It should use the SSH version of the URL. This is so DVC can authenticate the pushes and pulls from GitHub it needs as part of experiment sharing.</p> <p>The way DVC works is by storing custom Git refs in your repo with metadata that defines the experiment. You can learn more about how DVC uses custom Git refs in <a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">this post</a>.</p> <p>Next, you'll need to set up a remote to your data location. This could be an AWS S3 bucket, a Google Drive, or <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">one of the other supported storage types</a>.</p> <p>An important thing to note about the project we're working with is that there is already a remote set up for you to pull from. You can see this in <code>.dvc/config</code>. You'll need to set up a separate remote to push changes to since this remote doesn't allow push access.</p> <p>For this example, we'll be using a Google Drive folder as the remote to handle pushing data. Now that you know what we're doing, let's run the command to set up the DVC remote to push to.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> cloud_remote gdrive://1k6aUYWphOulJlXgq4XbfKExWGyymTpEl</span></code></pre></div> <p>This adds the remote storage named <code>cloud_remote</code> for DVC to track and we'll be able to push and pull the exact code and data to reproduce any experiment. With your Git remote and DVC remotes in place, you can start pulling data and experiments from the cloud to your local machine.</p> <p><em>Note: Make sure you have write permissions to the Git remote!</em></p> <h2 id="listing-your-remote-experiments" style="position:relative;">Listing your remote experiments<a href="#listing-your-remote-experiments" aria-label="listing your remote experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When you're working with a team on an existing project, you might want to see the experiments already in the remotes so you know what's available. To take a look at the experiments we have run in the repo you forked, you'll have to set up a new Git upsteam remote to reference the original repo. You can do that with the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git remote add</span> upstream https://github.com/iterative/example-dvc-experiment</span></code></pre></div> <p>Now you can take a look at all of the experiments we have associated with this repo with the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp list</span> upstream <span class="token parameter variable">--all</span></span></code></pre></div> <p>You'll get a list of all of the experiments across different Git branches that have been pushed with DVC in the original repo. The output will look similar to this.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">21784fa: exp-c8dcf main: exp-b3667 exp-d382a</code></pre></div> <p>Now you'll be able to pick which experiment you want to reproduce and start testing with.</p> <h2 id="pulling-experiments" style="position:relative;">Pulling experiments<a href="#pulling-experiments" aria-label="pulling experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you're picking up an existing project, there will likely be a specific experiment you'll get started with. To pull an experiment to your local machine, you'll need an experiment id for the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp pull</span> upstream exp-b3667</span></code></pre></div> <p>The <code>exp-b3667</code> comes from the <a href="https://dvc.org/doc/command-reference/exp/list"><code>dvc exp list</code></a> command we ran earlier and now you have all of the data and code associated with that experiment on your machine.</p> <p>From here, you can start running new experiments with different models, hyperparameters, or even datasets.</p> <h2 id="pushing-experiments" style="position:relative;">Pushing experiments<a href="#pushing-experiments" aria-label="pushing experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Once you're done with your new experiments, you can push these to the Google Drive remote we set up earlier. DVC handles both the GitHub and data storage pushes with this command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp push</span> origin exp-p4202</span></code></pre></div> <p>This will push the custom Git refs to your forked repo and it will push any artifacts, like your data or model output, to the DVC remote location. If you have checkpoints enabled, it will also push the checkpoints of an experiment. Now you can easily share your work with other engineers to get feedback faster and finish projects sooner.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>It's a lot easier to get help from someone on a project when you can share everything with them. When you use DVC, you can bundle your data and code changes for each experiment and push those to a remote for somebody else to check out.</p>https://dvc.org/blog/ml-experiment-versioninghttps://dvc.org/blog/ml-experiment-versioningTue, 07 Dec 2021 00:00:00 GMT<div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/z0s42TxH9oM?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>Experiment tracking tools help manage machine learning projects where version control tools like Git aren't enough. They log parameters and metrics, and they store artifacts like input data or model weights, so that you can reproduce experiments and retrieve results. They also provide a dashboard to navigate all this meta-information across lots of experiments.</p> <p>Git can't manage or compare all that experiment meta-information, but it is still better for code. Tools like GitHub make distributed collaboration easy, and you can see incremental code changes. That's why experiments get split between Git for code and experiment tracking tools for meta-information (usually with a link in one or the other to keep track).</p> <p>ML experiment versioning combines experiment tracking and version control. Instead of managing these separately, keep everything in one place and get the benefits of both, like:</p> <ul> <li><strong>Experiments as code</strong>: Track meta-information in the repository and version it like code.</li> <li><strong>Versioned reproducibility</strong>: Save and restore experiment state, and track changes to only execute what's new.</li> <li><strong>Distributed experiments</strong>: Organize locally and choose what to share, reusing your existing repo setup.</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 537px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4f65fcb9a5c8d32158b5283122c9dd10/39600/exp-versioning.png" alt="Experiment Versioning" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h1 id="ml-experiments-as-code" style="position:relative;">ML Experiments as Code<a href="#ml-experiments-as-code" aria-label="ml experiments as code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Experiment versioning treats experiments as code. It saves all metrics, hyperparameters, and artifact information in text files that can be versioned by Git (DVC <a href="https://dvc.org/doc/start/data-and-model-versioning" target="_blank" rel="nofollow noopener noreferrer">data versioning</a> backs up the artifacts themselves anywhere). You do not need a centralized database or online services. Git becomes a store for experiment meta-information.</p> <p>You can choose your own file formats and paths, which you can configure in DVC:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp init</span> <span class="token parameter variable">-i</span> </span>This command will guide you to set up a default stage in dvc.yaml. See https://dvc.org/doc/user-guide/project-structure/pipelines-files. DVC assumes the following workspace structure: ├── data ├── metrics.json ├── models ├── params.yaml ├── plots └── src Command to execute: python src/train.py Path to a code file/directory [src, n to omit]: src/train.py Path to a data file/directory [data, n to omit]: data/images/ Path to a model file/directory [models, n to omit]: Path to a parameters file [params.yaml, n to omit]: Path to a metrics file [metrics.json, n to omit]: Path to a plots file/directory [plots, n to omit]: logs.csv</code></pre></div> <p>Once you set up your repo in this structure, you start to see the benefits of this approach. Experiment meta-information lives in readable files that are always available, and your code can stay clean. You can read, save, and version your meta-information:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> params.yaml </span>train: epochs: 10 model: conv_units: 128</code></pre></div> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> metrics.json </span>{"loss": 0.24310708045959473, "acc": 0.9182999730110168}</code></pre></div> <p>You can see what changed in parameters, code, or anything else:</p> <div class="gatsby-highlight" data-language="diff"><pre class="language-diff"><code class="language-diff">$ git diff HEAD~1 -- params.yaml diff --git a/params.yaml b/params.yaml index baad571a2..57d098495 100644 <span class="token coord">--- a/params.yaml</span> <span class="token coord">+++ b/params.yaml</span> <span class="token coord">@@ -1,5 +1,5 @@</span> <span class="token unchanged"><span class="token prefix unchanged"> </span>train: <span class="token prefix unchanged"> </span> epochs: 10 </span><span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>model: <span class="token prefix deleted">-</span> conv_units: 16 </span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>model: <span class="token prefix inserted">+</span> conv_units: 128</span></code></pre></div> <p>With DVC, you can even compare lots of experiments from the terminal like you would in a dashboard:</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"><span class="token rows">$ dvc exp show </span> ───────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.epochs<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model.conv_units<span class="token hide">**</span></span></span> </span> ───────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.25183<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.9137<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>10<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>mybranch<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>Oct 23, 2021<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>10<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>16<span class="token hide">**</span></span> ├── 9a4ff1c <span class="token bold"><span class="token hide">**</span>[exp-333c9]<span class="token hide">**</span></span> 10:40 AM 0.25183 0.9137 10 64 ├── 138e6ea <span class="token bold"><span class="token hide">**</span>[exp-55e90]<span class="token hide">**</span></span> 10:28 AM 0.25784 0.9084 10 32 └── 51b0324 <span class="token bold"><span class="token hide">**</span>[exp-2b728]<span class="token hide">**</span></span> 10:17 AM 0.25829 0.9058 10 16 </span> ─────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <h1 id="versioned-reproducibility" style="position:relative;">Versioned reproducibility<a href="#versioned-reproducibility" aria-label="versioned reproducibility permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>One reason you need to track all this meta-information is to reproduce your experiment. Experiment tracking databases save the artifacts, but you still need to put them all back in the right place. Since experiment versioning keeps all the meta-information in your repo, you can restore the experiment state exactly as it was in your workspace. DVC <a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">saves the state of the experiment</a>, and it can restore it for you:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-333c9 </span> Changes for experiment 'exp-333c9' have been applied to your current workspace.</code></pre></div> <p>Reproducibility is nice, but data drift, new business requirements, bug fixes, etc. all mean running a slightly modified experiment. You don't have time to always start from scratch. Versioned reproducibility means tracking changes to the experiment state. DVC can determine what changes were introduced by the experiment and only run what's necessary. It only saves those changes, so you don't waste time or storage on duplicate copies of data.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">model.conv_units</span><span class="token operator">=</span><span class="token number">128</span> </span>'data/images.tar.gz.dvc' didn't change, skipping Stage 'extract' didn't change, skipping Running stage 'train': > python3 src/train.py 79/79 [==============================] - 1s 14ms/step - loss: 0.2552 - acc: 0.9180 Updating lock file 'dvc.lock' Reproduced experiment(s): exp-be916 Experiment results have been applied to your workspace. To promote an experiment to a Git branch run: dvc exp branch <exp> <branch></code></pre></div> <h1 id="distributed-experiments" style="position:relative;">Distributed Experiments<a href="#distributed-experiments" aria-label="distributed experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Experiment tracking tools log experiments to a central database and show them in a dashboard. This makes it easy to share them with teammates and compare experiments. However, it introduces a problem - in an active experimentation phase, you may create hundreds of experiments. Team members may be overwhelmed, and the tool loses one of its core purposes - sharing experiments between team members.</p> <p>Experiment versioning piggybacks on Git and its distributed nature. All the experiments you run are stored in your local repo, and only the best experiments are promoted to the central repo (GitHub for example) to share with teammates. Distributed experiments are shared with the same people as your code repo, so you don't need to replicate your project permissions or worry about security risks.</p> <p>With DVC, you can push experiments just like Git branches, giving you flexibility to share whatever, whenever, and wherever you choose:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp push</span> origin exp-333c9 </span>Pushed experiment 'exp-333c9'to Git remote 'origin'.</code></pre></div> <h1 id="what-next" style="position:relative;">What Next?<a href="#what-next" aria-label="what next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>These enhancements can have powerful ripple effects for fast-moving, complex, collaborative ML projects. There are parallels to the <a href="https://ericsink.com/vcbe/html/history_of_version_control.html" target="_blank" rel="nofollow noopener noreferrer">history of version control</a>. Git's distributed nature and incremental change tracking were major advances over the centralized, file-based version control systems of previous generations. Experiment versioning can similarly advance the next generation of experiment tracking.</p> <p>ML experiment versioning is still in its early days. Look out for future announcements about:</p> <ul> <li>Deep learning features like <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">live monitoring</a> and <a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints" target="_blank" rel="nofollow noopener noreferrer">checkpointing</a>.</li> <li>Visualizing and comparing experiment results in other tools like VS Code and <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">DVC Studio</a>.</li> </ul> <p>What do you want to see for the next generation of experiment tracking? Join our upcoming <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" target="_blank" rel="nofollow noopener noreferrer">meetup</a> to discuss, join our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord community</a>, or let us know in the comments!</p>https://dvc.org/blog/november-21-community-gemshttps://dvc.org/blog/november-21-community-gemsTue, 30 Nov 2021 00:00:00 GMT<h3 id="what-would-be-the-cleanest-most-pythonic-way-to-run-dvc-commands-from-inside-a-python-script-if-we-want-to-avoid-calling-the-subprocess-library" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/895570704605528094" target="_blank" rel="nofollow noopener noreferrer">What would be the cleanest, most Pythonic way to run DVC commands from inside a Python script if we want to avoid calling the subprocess library?</a><a href="#what-would-be-the-cleanest-most-pythonic-way-to-run-dvc-commands-from-inside-a-python-script-if-we-want-to-avoid-calling-the-subprocess-library" aria-label="what would be the cleanest most pythonic way to run dvc commands from inside a python script if we want to avoid calling the subprocess library permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>That's a really good question @mihaj!</p> <p>If you want to run DVC commands in a Python script, you have a couple of options.</p> <p>You can work with the <code>main</code> module from the <code>dvc</code> library. This is the more CLI-like option. An example of running an experiment would look something like this.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>main <span class="token keyword">import</span> main main<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"exp"</span><span class="token punctuation">,</span> <span class="token string">"run"</span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div> <p>The other option you have is to use the <code>Repo API</code>. This API is largely undocumented at the moment, but it closely mirrors the CLI commands. One exception is that they will return internal data structures instead of exit codes.</p> <p>Here's an example of running an experiment with the Repo API.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo repo <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token punctuation">)</span> repo<span class="token punctuation">.</span>experiments<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">)</span> repo<span class="token punctuation">.</span>experiments<span class="token punctuation">.</span>show<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># etc...</span></code></pre></div> <h3 id="how-can-you-check-if-a-dvc-tracked-directory-has-changes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/899693929560158218" target="_blank" rel="nofollow noopener noreferrer">How can you check if a DVC tracked directory has changes?</a><a href="#how-can-you-check-if-a-dvc-tracked-directory-has-changes" aria-label="how can you check if a dvc tracked directory has changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Good question from @edran!</p> <p>You can check which directories have been changed by running:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc status</span></span></code></pre></div> <p>This will give you an output similar to this in your terminal:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">changed deps</span><span class="token punctuation">:</span> <span class="token key atrule">modified</span><span class="token punctuation">:</span> src/train.py <span class="token key atrule">changed outs</span><span class="token punctuation">:</span> <span class="token key atrule">deleted</span><span class="token punctuation">:</span> model.pkl <span class="token key atrule">evaluate</span><span class="token punctuation">:</span> <span class="token key atrule">changed deps</span><span class="token punctuation">:</span> <span class="token key atrule">deleted</span><span class="token punctuation">:</span> model.pkl</code></pre></div> <p>We're working on adding granularity support for this command and should have a release for this in the next few months.</p> <h3 id="is-there-a-way-to-look-at-all-of-the-experiments-ive-run-and-see-the-metrics-and-parameters-associated-with-them" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/900451895666155520" target="_blank" rel="nofollow noopener noreferrer">Is there a way to look at all of the experiments I've run and see the metrics and parameters associated with them?</a><a href="#is-there-a-way-to-look-at-all-of-the-experiments-ive-run-and-see-the-metrics-and-parameters-associated-with-them" aria-label="is there a way to look at all of the experiments ive run and see the metrics and parameters associated with them permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for asking @GuyAR! This is a common question that comes up.</p> <p>You can see all of your experiments and the associated metrics and parameters in a table in the terminal by running the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span></span></code></pre></div> <p>This will give you a table that looks similar to this with all of this information.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> </span> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.91389<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.20506<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.66306<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> │ ╓ 9405575 [exp-54e8a] 3 0.91389 0.87 0.20506 0.66306 0.001 0.09 │ ╟ 856d80f 2 0.90215 0.87333 0.27204 0.61631 0.001 0.09 │ ╟ 23dc98f 1 0.87671 0.86 0.35964 0.61713 0.001 0.09 ├─╨ 99a3c34 0 0.71429 0.82 0.67674 0.62798 0.001 0.09 │ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09 │ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09 │ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09 ├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09 </span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <h3 id="whats-the-recommended-way-to-remove-data-that-has-been-imported-using-dvc-import" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/898462029650735134" target="_blank" rel="nofollow noopener noreferrer">What's the recommended way to remove data that has been imported using <code>dvc import</code>?</a><a href="#whats-the-recommended-way-to-remove-data-that-has-been-imported-using-dvc-import" aria-label="whats the recommended way to remove data that has been imported using dvc import permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Great question @MadsO!</p> <p>This works the exact same as when you've added data with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>. So to remove data, you would run this command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remove</span></span></code></pre></div> <h3 id="when-using-a-cml-are-github-actions-gitlab-and-bitbucket-the-only-options-for-ci" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/909847110306914345" target="_blank" rel="nofollow noopener noreferrer">When using a CML, are GitHub Actions, GitLab, and BitBucket the only options for CI?</a><a href="#when-using-a-cml-are-github-actions-gitlab-and-bitbucket-the-only-options-for-ci" aria-label="when using a cml are github actions gitlab and bitbucket the only options for ci permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Currently, <code>cml runner</code> does not support CircleCI or droneCI self–hosted runners and you would have to deploy them manually.</p> <p>You can still use <code>cml send-comment</code>, <code>cml pr</code>, and the other CML tools with any CI platform.</p> <p>Thanks for this awesome question @tpietruszka!</p> <h3 id="when-i-run-the-dvc-remove-command-does-it-only-remove-dvc-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/905382438786715648" target="_blank" rel="nofollow noopener noreferrer">When I run the <code>dvc remove</code> command, does it only remove <code>.dvc</code> files?</a><a href="#when-i-run-the-dvc-remove-command-does-it-only-remove-dvc-files" aria-label="when i run the dvc remove command does it only remove dvc files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>A really good question from @flowy!</p> <p>That is correct. Running <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a> only removes DVC tracked files and directories. It will also remove the entry from <code>.gitignore</code> and handles the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p> <p>For example, if you run something like <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove folder_name/file.dvc</code></a>, only the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file will be removed. So your updated directory would likely still have <code>folder_name/file</code> since that was the file being tracked.</p> <p>If you wanted to remove the tracked file as well, you would need to run <a href="https://dvc.org/doc/command-reference/remove#--outs"><code>dvc remove --outs</code></a>. This command removes the outputs of any target.</p> <p>If there is nothing else in the folder, you'll be left with an empty directory. You can remove it and stop tracking in Git with a command like:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token function">rm</span> <span class="token parameter variable">-r</span> folder_name</span></code></pre></div> <h3 id="can-dvc-studio-be-connected-to-a-self-managed-gitlab-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/841856466897469441/907468264882462800" target="_blank" rel="nofollow noopener noreferrer">Can DVC Studio be connected to a self-managed GitLab repo?</a><a href="#can-dvc-studio-be-connected-to-a-self-managed-gitlab-repo" aria-label="can dvc studio be connected to a self managed gitlab repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Very good question about Studio @Sra!</p> <p>Right now this only works if it's an on-premises network or a private VPC network.</p> <p>We are working on bringing custom-domain GitLab as a feature very soon! You can follow <a href="https://github.com/iterative/studio-support/issues/12" target="_blank" rel="nofollow noopener noreferrer">this GitHub issue</a> and leave comments for anything you'd like to see!</p> <h3 id="is-there-a-way-to-extend-default-job-execution-time-for-a-cml-runner" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/904660123161600021" target="_blank" rel="nofollow noopener noreferrer">Is there a way to extend default job execution time for a CML runner?</a><a href="#is-there-a-way-to-extend-default-job-execution-time-for-a-cml-runner" aria-label="is there a way to extend default job execution time for a cml runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There is definitely a way to do this!</p> <p>You can extend the max time in your CI by adding something like this:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">timeout-minutes</span><span class="token punctuation">:</span> <span class="token number">5000</span></code></pre></div> <p>If you're using GitLab, the same update would look similar to this:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">timeout</span><span class="token punctuation">:</span> 72 hours</code></pre></div> <p>Thanks for this question @evergreengt!</p> <hr> <p><img src="https://media.giphy.com/media/VInc9GYelUbHf5QhNR/giphy.gif" alt="Matt Fraser GIF by E!"></p> <p>At our December Office Hours Meetup we will be doing a new feature demo you won't want to miss! <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/november-21-heartbeathttps://dvc.org/blog/november-21-heartbeatWed, 17 Nov 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>I can't believe it's already November! Our Community has given us a lot to be thankful for!</p> <p><img src="https://media.giphy.com/media/vLxTOSEfHIr0A/giphy.gif" alt="Hello November!"></p> <h2 id="thanakorn-panyapiangs-two-part-tutorial-data-versioning-with-dvc" style="position:relative;">Thanakorn Panyapiang's Two Part tutorial: Data Versioning with DVC<a href="#thanakorn-panyapiangs-two-part-tutorial-data-versioning-with-dvc" aria-label="thanakorn panyapiangs two part tutorial data versioning with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In his two part tutorial which can be found <a href="https://medium.com/@thanakornpanyapiang/data-versioning-why-do-data-science-projects-need-it-a44cb7a471c9" target="_blank" rel="nofollow noopener noreferrer">here</a> and <a href="https://medium.com/@thanakornpanyapiang/data-versioning-with-dvc-a474af1247f5" target="_blank" rel="nofollow noopener noreferrer">here,</a> <a href="https://www.linkedin.com/in/tpanyapiang/" target="_blank" rel="nofollow noopener noreferrer"><strong>Thanakorn Panyapiang</strong></a> first explains why data versioning is so important to successful machine learning projects. Next he takes us through a tutorial of DVC showing how to install and initiate DVC. Finally he covers tracking, pushing to remote storage, modifying and switching the data. In the future look out for more posts on the other features of DVC, including pipelines, metrics, experiments and continuous integration through CML from Thanakorn!</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@thanakornpanyapiang/data-versioning-with-dvc-a474af1247f5/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data Versioning with DVC</h4> <div class="elp-description">Thanakorn Panyapiang's explanation of the importance of data version control in ML projects and tutorial on DVC.</div> <div class="elp-link">https://medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-11-17/panyapiang-6809a26f4e50374dbdb69b06c83d1cf5.jpeg" alt="Data Versioning with DVC"> </div> </a> </section> <p></p> <h2 id="sanaka-chathuranga-end-to-end-machine-learning-pipeline-with-mlops-tools" style="position:relative;">Sanaka Chathuranga: End to End Machine Learning Pipeline with MLOps tools<a href="#sanaka-chathuranga-end-to-end-machine-learning-pipeline-with-mlops-tools" aria-label="sanaka chathuranga end to end machine learning pipeline with mlops tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/shanakac/" target="_blank" rel="nofollow noopener noreferrer"><strong>Shanaka Chathuranga</strong></a> uses multiple tools including DVC to build an end to end Machine Learning Pipeline. In the mix you'll find Cookiecutter, DVC, Mlflow, GitHub Actions, Heroku, Flask, Evidently AI, and PyTest in <a href="https://medium.com/@shanakachathuranga/end-to-end-machine-learning-pipeline-with-mlops-tools-mlflow-dvc-flask-heroku-evidentlyai-github-c38b5233778c" target="_blank" rel="nofollow noopener noreferrer">his post</a> in <a href="https://medium.com/" target="_blank" rel="nofollow noopener noreferrer">Medium.</a> DVC is used for data versioning and model pipeline management in this tutorial.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/799a0ab79de5777ed1465c8eb8404a2e/39600/shanaka.png" alt="End to End Machine Learning Pipeline" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Shanaka Chathuranga's End to End ML Pipeline Tools Stack (<a href="https://medium.com/@shanakachathuranga/end-to-end-machine-learning-pipeline-with-mlops-tools-mlflow-dvc-flask-heroku-evidentlyai-github-c38b5233778c" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <p>📣 Swag to the first person to do a similar tutorial using DVC for experiment tracking and versioning and CML for CI/CD. 🚦Go!👉🏽</p> <h2 id="covid-genomics-apache-airflow-and-dvc-integration" style="position:relative;">COVID Genomics Apache Airflow and DVC Integration<a href="#covid-genomics-apache-airflow-and-dvc-integration" aria-label="covid genomics apache airflow and dvc integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://covidgenomics.com/blog/airflow_dvc/" target="_blank" rel="nofollow noopener noreferrer">In this blog post,</a> <a href="https://www.linkedin.com/in/piotrstyczynski/" target="_blank" rel="nofollow noopener noreferrer"><strong>Piotr Styczyński</strong></a> of <a href="https://covidgenomics.com/" target="_blank" rel="nofollow noopener noreferrer">COVID Genomics</a> shares how they use Airflow and DVC together in their work to model SARS Cov-2 and optimizing RT-PCR tests. They needed to update the data used for the training model daily and automate their processses to make sure the whole process stays up-to-date.</p> <p>Be sure to check out the very detailed tutorial with lots of delicious code and two repositories <a href="https://github.com/covid-genomics/airflow-dvc" target="_blank" rel="nofollow noopener noreferrer">here</a> and <a href="https://github.com/covid-genomics/dvc-fs" target="_blank" rel="nofollow noopener noreferrer">here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 570px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/32dbfd16f40c957d08f13e231577f71a/39600/covid-genomics.png" alt="Airflow + DVC Integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Piotr Styczyński's blog on COVID Genomics use of Airflow with DVC (<a href="https://covidgenomics.com/blog/airflow_dvc/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="looking-to-create-a-light-weight-feature-store" style="position:relative;">Looking to create a light weight Feature Store?<a href="#looking-to-create-a-light-weight-feature-store" aria-label="looking to create a light weight feature store permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Remember <a href="https://twitter.com/jcpsantiago" target="_blank" rel="nofollow noopener noreferrer"><strong>João Santiago</strong></a> from <a href="https://github.com/jcpsantiago/dvthis" target="_blank" rel="nofollow noopener noreferrer">dvthis?</a> Well he's back at it solving ML engineering challenges, sharing his new blog post, <a href="https://medium.com/billie-finanzratgeber/unlocking-our-data-with-a-feature-store-402ade0743b" target="_blank" rel="nofollow noopener noreferrer">Unlocking Our Data with a Feature Store.</a> In this article from the <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io</a> engineering crew, Santiago shows how they implemented a light weight feature store creating a system in which features are defined in YAML files (gotta love those YAML files 😉) interfacing with Snowflake. Check out how they did it, and learn the term "instarejected" which he coined and we all should instaadopt!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 509px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bce00f72f761efe5c1b6924c7e398c42/39600/billie.png" alt="Billie.io Lightweight Feature Store" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Billie.io's feature store: Snowflake + Lambda + Redis (<a href="https://medium.com/billie-finanzratgeber/unlocking-our-data-with-a-feature-store-402ade0743b" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="learn-about-dvc-en-español" style="position:relative;">Learn about DVC en Español!<a href="#learn-about-dvc-en-espa%C3%B1ol" aria-label="learn about dvc en español permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://tryolabs.com/" target="_blank" rel="nofollow noopener noreferrer">TryoLabs</a> held an Open Meetup recently in Uraguay teaching about some of the technology they use at this consultancy. <a href="https://www.linkedin.com/in/ianspektor/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ian Spektor</strong>,</a> <a href="https://www.linkedin.com/in/diego-kiedanski/" target="_blank" rel="nofollow noopener noreferrer"><strong>Diego Kiedanski</strong>,</a> and <a href="https://www.linkedin.com/in/nicol%C3%A1s-eiris-64916194/" target="_blank" rel="nofollow noopener noreferrer"><strong>Nicolás Eiris</strong></a> presented on the their learnings and use of DVC to get better organization of their data for the various projects they work on with their clients. In addition to streamlining the onboarding of the data for their projects, DVC has provided them reproducibility of the various data and code versions in their workflows.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/4uEjIa-f_FE?rel=0&%3B=&%3Bshowinfo=0%3B&start=268" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>Also en Español, our own <a href="https://twitter.com/daviddelachurch" target="_blank" rel="nofollow noopener noreferrer"><strong>David de la Iglesia Castro</strong></a> will be presenting at <a href="https://pybcn.org/events/pyday_bcn/pyday_bcn_2021/" target="_blank" rel="nofollow noopener noreferrer">Python Barcelona</a> on "Making MLOps Uncool Again." In this workshop David will show you how to use HuggingFace, DVC and CML to create an MLOps workflow, extending the power of Git and GitHub without the need for external platforms or complicated infrastructure.</p> <p> </p><section class="elp-content-holder"> <a href="https://pybcn.org/events/pyday_bcn/pyday_bcn_2021/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Python Barcelona</h4> <div class="elp-description">Join David de la Iglesia Castro for his workshop entitled Making MLOps Uncool Again.</div> <div class="elp-link">https://pybcn.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-11-17/py-barcelona-8da4ea0fb572ed2e8ee49c791bc3d1b7.png" alt="Python Barcelona"> </div> </a> </section> <p></p> <h2 id="october-office-hours-video-continuum-industries-tool-stack-with-ivan-chan" style="position:relative;">October Office Hours Video: Continuum Industries Tool Stack with Ivan Chan<a href="#october-office-hours-video-continuum-industries-tool-stack-with-ivan-chan" aria-label="october office hours video continuum industries tool stack with ivan chan permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you missed last month's Office Hours <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>, you can now catch the video! <a href="https://www.linkedin.com/in/ivanchc/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ivan Chan</strong></a> took us on a journey through the <a href="https://www.continuum.industries/" target="_blank" rel="nofollow noopener noreferrer">Continuum Industries</a> tool stack and showed us how they save tons of time weekly by integrating DVC and CML into their workflows.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/TBZKfyYWtXs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="are-you-a-data-scientist-struggling-with-some-of-the-ml-engineering-concepts" style="position:relative;">Are you a Data Scientist Struggling with some of the ML engineering concepts?<a href="#are-you-a-data-scientist-struggling-with-some-of-the-ml-engineering-concepts" aria-label="are you a data scientist struggling with some of the ml engineering concepts permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="atinuke-oluwabamikemi-kayode-common-github-terms-for-open-source-contributors" style="position:relative;">Atinuke Oluwabamikemi Kayode: Common Github Terms for Open Source Contributors<a href="#atinuke-oluwabamikemi-kayode-common-github-terms-for-open-source-contributors" aria-label="atinuke oluwabamikemi kayode common github terms for open source contributors permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>For the learners out there, <a href="https://twitter.com/oluwabamikemi" target="_blank" rel="nofollow noopener noreferrer"><strong>Atinuke Oluwabamikemi Kayode's</strong></a> piece <a href="https://iambami.dev/common-github-terms-for-open-source-contributors-ckvuhdzsf0jcocms1fggb0fj3" target="_blank" rel="nofollow noopener noreferrer">Common Github Terms for Open Source Contributors</a> shares about all the most common terminology you need to know when using GitHub in your projects. Need to understand what "checkout" is? The difference between "origin" and "master?" Atinuke has you covered in this piece.</p> <p> </p><section class="elp-content-holder"> <a href="https://iambami.dev/common-github-terms-for-open-source-contributors-ckvuhdzsf0jcocms1fggb0fj3" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Common GitHub Terms for Open Source Contributors</h4> <div class="elp-description">Atinuke Oluwabamikemi Kayode helps you navigate the common terminalogy in GitHub.</div> <div class="elp-link">https://iambami.dev</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-11-17/kayode-e1b389ce00e5cf7d58417b6598467526.jpeg" alt="Common GitHub Terms for Open Source Contributors"> </div> </a> </section> <p></p> <h3 id="vincent-driessen-a-successful-git-branching-architecture" style="position:relative;">Vincent Driessen: A Successful Git Branching Architecture<a href="#vincent-driessen-a-successful-git-branching-architecture" aria-label="vincent driessen a successful git branching architecture permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>For a deeper dive into how Git and versioning works, checkout <a href="https://nvie.com/posts/a-successful-git-branching-model/" target="_blank" rel="nofollow noopener noreferrer">A Successful Git Branching Model</a> piece by <a href="https://twitter.com/nvie" target="_blank" rel="nofollow noopener noreferrer"><strong>Vincent Driessen</strong></a> which explains in detail the git branching model. While this explanation is as it relates to software development, it will help you understand how git versioning works. This foundation will help provide the insight into how DVC works, delivering the same capabilities for data, models and experimentation.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 575px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5cda182ace005d6b228eae1f2de4cf92/39600/git-model.png" alt="Git Versioning in Software Development" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Vincent Driessen's Git Model Branch (<a href="https://nvie.com/posts/a-successful-git-branching-model/" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h3 id="nir-barazida-notebook-to-production" style="position:relative;">Nir Barazida: Notebook to Production<a href="#nir-barazida-notebook-to-production" aria-label="nir barazida notebook to production permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/barazida" target="_blank" rel="nofollow noopener noreferrer"><strong>Nir Barazida</strong></a> of <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> brings us a blog post on <a href="https://dagshub.com/blog/notebook-to-production-ready-machine-learning/" target="_blank" rel="nofollow noopener noreferrer">Notebook to Production</a> which explains why you should, and how you can, move your code from notebooks to scripts when working on production ready ml projects. You'll see how DVC is used to version everything in the process so your team will always know which version of all the possible elements that go into your project produced or failed to produce the best results.</p> <p> </p><section class="elp-content-holder"> <a href="https://dagshub.com/blog/notebook-to-production-ready-machine-learning/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Notebook to Production</h4> <div class="elp-description">Nir Barazida shows you why and how to bring your notebook to production ready code.</div> <div class="elp-link">https://dagshub.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-11-17/dagshub-dvc-9c3e49c8b98691eb79706d384438303d.png" alt="Notebook to Production"> </div> </a> </section> <p></p> <h2 id="dvc-online-course-update" style="position:relative;">DVC Online Course Update!<a href="#dvc-online-course-update" aria-label="dvc online course update permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We know you've wanted it, and the day is getting closer and closer! By the end of this week we will be about 90% done recording videos for the first course, and then it's on to video processing and platform set up. The first course will focus on DVC for Data Scientists and Analysts. You can expect to see the course out by the end of the year. The course will be 100% <strong>FREE</strong> and available from our website. We are so excited about how it's coming to life! 🚀</p> <p><img src="https://media.giphy.com/media/hL9q5k9dk9l0wGd4e0/giphy.gif" alt="Loading Downloading GIF by Vera Verreschi"></p> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="san-francisco-off-site" style="position:relative;">San Francisco Off-site<a href="#san-francisco-off-site" aria-label="san francisco off site permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The group of us from the Americas met in San Francisco last week. We had a great time getting to know each other better and working on ways and processes to make our tools even better for you! Amidst our working, we also took time out to visit Alcatraz, go on a scavenger hunt, and eat some great food! Pictured below from left front, going clockwise: Jorge Orpinel, Stephanie Roy, Ivan Shcheklein, Dmitry Petrov, Dave Berenbaum, Jervis Hui, Ken Thom, Jon Burdo, Peter Rowlands, Julie Galvan, Jeny De Figueiredo, Jordan Weber, and Maria Khalusova! 🎉</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/77dbbc25c91aa6f4b5b96ed0dd55c408/03346/team.jpg" alt="America Team Members meet in San Francisco" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative Team Members meet in San Francisco (<a href="https://www.linkedin.com/in/jorgeorpinel/" target="_blank" rel="nofollow noopener noreferrer">Source: Jorge Orpinel</a>)</em></p> <h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/maria-khalusova-a958aa14/" target="_blank" rel="nofollow noopener noreferrer"><strong>Maria Khalusova</strong></a> joins us from Montreal, Canada as a Senior Developer Advocate. Previously at Jet Brains for 14 years, Maria brings a ton of experience in developer advocacy and product management. She has already dove in working on CML and the upcoming releases. She also organizes PyData Montreal. In her free time Maria likes to spend time with her two kids, walk their mixed bull dog, and garden. 👩🏻‍🌾 Welcome Maria!</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As always, we're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions including:</p> <ul> <li>Senior Software Engineer (ML, Labeling, Python)</li> <li>Senior Software Engineer (ML, Labeling, Python)</li> <li>Senior Software Engineer (ML, DevTools, Python)</li> <li>Field Data Scientist / Sales Engineer</li> <li>Developer Advocate (ML)</li> <li>Director / VP of Engineering (ML, DevTools)</li> <li>Director / VP of Product (ML, Data Infra, SaaS)</li> <li>Head of Talent</li> <li>Head of DevRel</li> </ul> <p>Please pass this info on to anyone you know that may fit the bill. We look forward to new team members! 🎉</p> <p><img src="https://media.giphy.com/media/ZcQXsVrAuKMePTJYG6/giphy.gif" alt="Hyper RPG GIF"></p> <h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This month's important doc updates come from CML! The CML team has been on fire 🔥 building new things. You will want to keep your eyes tuned to <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML.dev</a> and our social media channels for big news before the end of the year!</p> <h3 id="-cml-self-hosted-runners" style="position:relative;">📖 CML: Self-hosted Runners<a href="#-cml-self-hosted-runners" aria-label=" cml self hosted runners permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Check out the new <a href="https://cml.dev/doc/self-hosted-runners?tab=GitLab#allocating-cloud-compute-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">Self-hosted Runners</a> doc. This will help you set up your own runners and allocate cloud computing resources. Whether you are a GitHub or GitLab user, you will be able to toggle between the respective code needed right there at your fingertips!</p> <h3 id="-cml-command-reference-send-comment" style="position:relative;">📖 CML: Command Reference: <code>send-comment</code><a href="#-cml-command-reference-send-comment" aria-label=" cml command reference send comment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The new <a href="https://cml.dev/doc/ref/send-comment#command-reference-send-comment" target="_blank" rel="nofollow noopener noreferrer">Command Reference: send-comment</a> doc provides a way for you to post a markdown comment on a commit and flags for associating the comment with another pull/merge request or if a <code>cml pr</code> was used earlier in your workflow.</p> <h3 id="-branding-assets" style="position:relative;">📖 Branding Assets<a href="#-branding-assets" aria-label=" branding assets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you are interested in writing a blog post about our tools, we now have a very easy way for you to get your hands on our logos as well as a guide to let you know how and where it's appropriate to use our logos and images. We love when the Community shares about our tools!<br> <a href="https://iterative.ai/brand" target="_blank" rel="nofollow noopener noreferrer">Find our branding assets here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1b99c851fb73a6233fc6f05e59984d90/39600/brand.png" alt="Iterative.AI Branding Asseets" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative.AI branded assets from your next blog post 😉 (<a href="https://iterative.ai/brand" target="_blank" rel="nofollow noopener noreferrer">Source:</a>)</em></p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Be sure to join us at the <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" target="_blank" rel="nofollow noopener noreferrer">December Office Hours Meetup,</a> where we will be showing a demo on a new feature! We can't say more just yet 🤐, but be sure to RSVP!</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/282064369/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC Office Hours - New Feature Release</h4> <div class="elp-description">Join us at the December Office Hours for a demo of a new feature in DVC!</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-10-15/office-hours-meetup-409c5ab48d208e9a9cdc6871fd4c0937.png" alt="DVC Office Hours - New Feature Release"> </div> </a> </section> <p></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Last but never least, I leave you with this great tweet from Paige Bailey, this time about CML's docs:</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🦉<a href="https://twitter.com/DVCorg">@DVCorg</a>'s docs are *shiny*—especially the sample code for generating reports, using either <a href="https://twitter.com/github">@GitHub</a> or <a href="https://twitter.com/gitlab">@GitLab</a>.<a href="https://t.co/PKPS923HUR">https://t.co/PKPS923HUR</a><br><br>All you have to do to auto-generate a report with metrics and plots, is include the YAML file in a .github/workflows folder in your repo. <a href="https://t.co/WTSZYcLjwI">pic.twitter.com/WTSZYcLjwI</a></p>— 👩‍💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1459395186027470849">November 13, 2021</a></blockquote> <hr> <p><em>Have something great to say about our tools? We'd love to hear it! Head to <a href="https://testimonial.to/iterative-open-source-community-shout-outs" target="_blank" rel="nofollow noopener noreferrer">this page</a> to record or write a Testimonial! Join our <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer">Wall of Love ❤️</a></em></p> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/october-21-community-gemshttps://dvc.org/blog/october-21-community-gemsThu, 28 Oct 2021 00:00:00 GMT<h3 id="is-there-a-command-to-force-reproduce-a-specific-stage-of-a-dvc-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/893056918699008000" target="_blank" rel="nofollow noopener noreferrer">Is there a command to force reproduce a specific stage of a DVC pipeline?</a><a href="#is-there-a-command-to-force-reproduce-a-specific-stage-of-a-dvc-pipeline" aria-label="is there a command to force reproduce a specific stage of a dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Good question @wickeat!</p> <p>You can use <a href="https://dvc.org/doc/command-reference/repro#-f"><code>dvc repro -f <stage_name></code></a>, although this will reproduce the earlier dependency stages in the pipeline up to that point. If you only want to reproduce a single target stage, you can add <code>-s/--single-item</code> to the <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command.</p> <h3 id="how-do-you-manage-a-dvcyaml-file-for-a-project-thats-going-to-be-a-big-sparse-dag" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/893487527749623859" target="_blank" rel="nofollow noopener noreferrer">How do you manage a <code>dvc.yaml</code> file for a project that's going to be a big, sparse DAG?</a><a href="#how-do-you-manage-a-dvcyaml-file-for-a-project-thats-going-to-be-a-big-sparse-dag" aria-label="how do you manage a dvcyaml file for a project thats going to be a big sparse dag permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This an awesome use case from @Ian!</p> <p>Let's say we have this scenario:</p> <ul> <li>A new data set is delivered to you every day</li> <li>It needs to be featurized (does not depend on previous days' data)</li> <li>Subsequent stage depends on all days</li> </ul> <p>The recommended approach is to keep all of the previous days and use the <code>foreach</code> syntax, which ensures your DAG still knows about all the previously processed days:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">featurize</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token number">20210101</span> <span class="token punctuation">-</span> <span class="token number">20210102</span> <span class="token punctuation">-</span> <span class="token number">20210103</span> <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python featurize.py $<span class="token punctuation">{</span>item<span class="token punctuation">}</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> raw/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.csv <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> intermediate/$<span class="token punctuation">{</span>item<span class="token punctuation">}</span>.csv <span class="token key atrule">combine</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python combine.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> intermediate <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> combined.csv</code></pre></div> <p>That way if you adjusted something in your featurize script, for example, it would automatically reprocess every day's data.</p> <h3 id="what-is-the-best-practice-for-capturing-and-saving-stdout" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/893903023355613214" target="_blank" rel="nofollow noopener noreferrer">What is the best practice for capturing and saving <code>stdout</code>?</a><a href="#what-is-the-best-practice-for-capturing-and-saving-stdout" aria-label="what is the best practice for capturing and saving stdout permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The best practice when using DVC is to pipe each command <code>stdout</code> into a different file with a unique name, like a timestamp, in a directory that becomes the stage output.</p> <p>If optimizing storage space is a concern, in case the <code>stdout</code> dumps grow a lot, this is what we recommend.</p> <p>Here's an example of what that might look like if you're using a tool like <code>tee</code>.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python src/train.py data/features model.pkl <span class="token punctuation">|</span> tee <span class="token punctuation">-</span>a 20211021_model.pkl <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/features <span class="token punctuation">-</span> src/train.py <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> train.min_split <span class="token punctuation">-</span> train.n_est <span class="token punctuation">-</span> train.seed <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> models/20211026_model.pkl</code></pre></div> <p>This will output the <code>stdout</code> from the train stage in the terminal and also save it in a new file with the timestamp as part of the title.</p> <p>That was a helpful question. Thanks @gregk0!</p> <h3 id="there-is-a-file-in-our-pipeline-that-needs-to-be-manually-modified-and-then-used-as-the-input-to-other-stages-in-the-pipeline-what-would-be-the-best-approach-for-this-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/894577842363445308" target="_blank" rel="nofollow noopener noreferrer">There is a file in our pipeline that needs to be manually modified and then used as the input to other stages in the pipeline. What would be the best approach for this with DVC?</a><a href="#there-is-a-file-in-our-pipeline-that-needs-to-be-manually-modified-and-then-used-as-the-input-to-other-stages-in-the-pipeline-what-would-be-the-best-approach-for-this-with-dvc" aria-label="there is a file in our pipeline that needs to be manually modified and then used as the input to other stages in the pipeline what would be the best approach for this with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is another great use case. Thanks @omarelb!</p> <p>Let's say that you have a process similar to this.</p> <ul> <li>Run the first stage of the pipeline, for example a stage called <code>cleaning</code></li> <li>Inspect its output, <code>lexicon.txt</code>, and modify it if necessary</li> <li>The modified version of <code>lexicon.txt</code> is then cached and used as input to following stages of the pipeline</li> </ul> <p>You can copy the output and modify and commit it in the copied location so the first stage and its output are separate from the modified file and subsequent stages.</p> <p>If you want to link the first stage to the rest of the pipeline, you could have your 2nd stage be something like:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">manual</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> # To generate lexicon_modified.txt: # 1. Run `cp lexicon.txt lexicon_modified.txt`. # 2. Check and modify lexicon_modified.txt. # 3. Run `dvc commit manual`.</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> lexicon.txt <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> lexicon_modified.txt</code></pre></div> <p>To clarify, if you put that <code>manual</code> stage into your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, it should connect the whole pipeline. Each time you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> and the first stage generates a new <code>lexicon.txt</code>, you will get <code>ERROR: failed to reproduce 'dvc.yaml': output 'lexicon_modified.txt' does not exist</code> because the manual stage doesn't generate the expected output.</p> <p>You can then manually copy, modify, and commit your new <code>lexicon_modified.txt</code> and run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> again to run the rest of the pipeline.</p> <h3 id="what-is-the-workflow-if-i-want-to-remove-some-files-from-my-dataset-registry-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/895192983366942740" target="_blank" rel="nofollow noopener noreferrer">What is the workflow if I want to remove some files from my dataset registry with DVC?</a><a href="#what-is-the-workflow-if-i-want-to-remove-some-files-from-my-dataset-registry-with-dvc" aria-label="what is the workflow if i want to remove some files from my dataset registry with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In this case, assume that the data was added as a folder containing images, which means that there is a single <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> for the whole folder. You don't need to remove the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file that's tracking the data in that folder.</p> <p>You can delete the files you want to remove and then re-add the folder using <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a>. Here's what an example of what that flow might look like.</p> <ul> <li>You <code>git clone</code> your data registry.</li> <li>Then <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> your data.</li> <li>Delete the files you want to remove.</li> <li>Run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> and <code>git commit</code> to save your changes.</li> </ul> <p>It should be faster to commit, as DVC won't re-add the files to the cache nor will it try to hash them.</p> <p>Good question @MadsO!</p> <h3 id="we-want-to-access-a-private-git-repo-using-dvcapiread-in-a-docker-container-how-do-i-pass-the-credentials-to-dvc-so-that-we-can-read-dvc-files-from-this-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/894533078389784577" target="_blank" rel="nofollow noopener noreferrer">We want to access a private Git repo using <code>dvc.api.read()</code> in a Docker container. How do I pass the credentials to DVC so that we can read DVC files from this repo?</a><a href="#we-want-to-access-a-private-git-repo-using-dvcapiread-in-a-docker-container-how-do-i-pass-the-credentials-to-dvc-so-that-we-can-read-dvc-files-from-this-repo" aria-label="we want to access a private git repo using dvcapiread in a docker container how do i pass the credentials to dvc so that we can read dvc files from this repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Great question about the API @dashmote!</p> <p>There are a couple different ways to handle this.</p> <p>The first option is to use SSH. You'll need to pass GitHub SSH keys into your Docker container and use the <code>[email protected]:username/repo.git</code> URL format when you call the API method.</p> <p>The other option is to use HTTP. You need to use the <code>https://username:[email protected]/username/repo.git</code> URL format when you call the API method.</p> <p>You could pass your credentials into your container as environment variables and then do something like:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">username <span class="token operator">=</span> os<span class="token punctuation">.</span>environ<span class="token punctuation">[</span><span class="token string">"GITHUB_USERNAME"</span><span class="token punctuation">]</span> token <span class="token operator">=</span> os<span class="token punctuation">.</span>environ<span class="token punctuation">[</span><span class="token string">"GITHUB_TOKEN"</span><span class="token punctuation">]</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">,</span> repo<span class="token operator">=</span><span class="token string-interpolation"><span class="token string">f"https://</span><span class="token interpolation"><span class="token punctuation">{</span>username<span class="token punctuation">}</span></span><span class="token string">:</span><span class="token interpolation"><span class="token punctuation">{</span>token<span class="token punctuation">}</span></span><span class="token string">/..."</span></span><span class="token punctuation">,</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">)</span></code></pre></div> <h3 id="is-there-a-clean-way-to-handle-multiple-models-in-the-same-repo-that-are-trained-using-the-same-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/895368479853649930" target="_blank" rel="nofollow noopener noreferrer">Is there a clean way to handle multiple models in the same repo that are trained using the same pipeline?</a><a href="#is-there-a-clean-way-to-handle-multiple-models-in-the-same-repo-that-are-trained-using-the-same-pipeline" aria-label="is there a clean way to handle multiple models in the same repo that are trained using the same pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let's say your project looks something like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">├── data │ ├── customer_1 │ │ ├── input_data.txt │ │ ├── input_params.yaml │ │ └── output │ │ └── model.pkl │ └── customer_2 │ ├── input_data.txt │ ├── input_params.yaml │ └── output │ └── model.pkl ├── dvc.lock ├── dvc.yaml └── train_model.py</code></pre></div> <p>The simplest way is to copy the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> into each model's separate directory, like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">├── data │ ├── customer_1 │ │ ├── input_data.txt │ │ ├── input_params.yaml │ │ ├── dvc.yaml │ │ ├── dvc.lock │ │ └── output │ │ └── model.pkl │ └── customer_2 │ ├── input_data.txt │ ├── input_params.yaml │ ├── dvc.yaml │ ├── dvc.lock │ └── output │ └── model.pkl └── train_model.py</code></pre></div> <p>Another potential solution is try templating. We'll have a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in the root of the project and add <code>vars</code> to define the model you want to train. Then we'll update the <code>train</code> stage to use the <code>vars</code> like this:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">model_name</span><span class="token punctuation">:</span> <span class="token string">'customer_2'</span> <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/$<span class="token punctuation">{</span>model_name<span class="token punctuation">}</span>/input_data.txt <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/$<span class="token punctuation">{</span>model_name<span class="token punctuation">}</span>/input_params.yaml<span class="token punctuation">:</span> <span class="token punctuation">-</span> batch_size <span class="token punctuation">-</span> <span class="token punctuation">...</span></code></pre></div> <p>You can <a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">learn more about templating in the docs</a>. It essentially lets you add variables to the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> to dynamically set values for your stages.</p> <p>Thanks for the great question @omarelb!</p> <hr> <p><img src="https://media.giphy.com/media/26u4lOMA8JKSnL9Uk/giphy.gif" alt="My Work Is Done Reaction GIF by SpongeBob SquarePants"></p> <p>At our November Office Hours Meetup we will be going over internal Kaggle competitions and PyTorch Lightening integration. <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/281355245/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/october-21-heartbeathttps://dvc.org/blog/october-21-heartbeatFri, 15 Oct 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>This month we have been flooded with content from our Community. We are grateful and inspired to keep serving you!</p> <p><img src="https://media.giphy.com/media/xUA7aN1MTCZx97V1Ic/giphy.gif" alt="Thank you!"></p> <h2 id="ricardo-manhães-savii-trying-to-turn-machine-learning-into-value" style="position:relative;">Ricardo Manhães Savii: Trying to turn Machine Learning into value<a href="#ricardo-manh%C3%A3es-savii-trying-to-turn-machine-learning-into-value" aria-label="ricardo manhães savii trying to turn machine learning into value permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If we can't turn machine learning into value, what good are we? <a href="https://www.linkedin.com/in/ricardoms/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ricardo Manhães Savii</strong></a> <a href="https://medium.com/@ricardosavii/trying-to-turn-machine-learning-into-value-de9f28cde056" target="_blank" rel="nofollow noopener noreferrer">wrote a piece in Medium</a> where he tackles how to technically and visually define the steps to deliver an Intelligent System with the same level of best practice maturity that software development has today. He combines and synthesizes the ideas of some of the best known thinkers in the space to build a thorough architecture of machine learning best practices. You won't want to miss this post and wrap your head around these diagrams!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 596.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bd6a61c1bdb9d432121f1a603588a9fa/39600/manhaes.png" alt="CI/CD for Machine Learning" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Ricardo Manhães Savii's Addendum to François Chollet's](<a href="https://medium.com/@francois.chollet" target="_blank" rel="nofollow noopener noreferrer">https://medium.com/@francois.chollet</a>) figure on result of machine learning (<a href="https://medium.com/@ricardosavii/trying-to-turn-machine-learning-into-value-de9f28cde056" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="rappibank-how-to-build-an-efficient-machine-learning-project-workflow" style="position:relative;">RappiBank: How to build an efficient machine learning project workflow<a href="#rappibank-how-to-build-an-efficient-machine-learning-project-workflow" aria-label="rappibank how to build an efficient machine learning project workflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Continuing the theme of ML workflow Complexity, <a href="https://www.linkedin.com/in/data-box-science/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Baena</strong></a> wrote a <a href="https://medium.com/rappibank/how-to-build-an-efficient-machine-learning-project-workflow-using-data-version-control-dvc-aaeaa9cfb79b" target="_blank" rel="nofollow noopener noreferrer">great overview and tutorial piece</a> outlining the challenges that his team at <a href="https://bank.rappi.com.br/" target="_blank" rel="nofollow noopener noreferrer">RappiBank</a> encountered and found ways to solve with DVC including:</p> <ul> <li>confusing experiment files with different names</li> <li>disjointed messaging about training and models and dataset changes</li> <li>holding in your head or own notes progress that is not visible to the rest of the team</li> <li>heavy run and re-run times without a modularized system</li> </ul> <p>Daniel shows how all of these things can be solved using DVC.🏆</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/rappibank/how-to-build-an-efficient-machine-learning-project-workflow-using-data-version-control-dvc-aaeaa9cfb79b" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">How to Build an Efficient Machine Learning Project Workflow Usign Data Version Control (DVC)</h4> <div class="elp-description">Daniel Baena's overview of common MLOps challenges encoutered at Rappi Bank and how they are solved with DVC.</div> <div class="elp-link">https://medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-10-15/baena-f06654520af4066465ffd7982e0b0fea.jpeg" alt="How to Build an Efficient Machine Learning Project Workflow Usign Data Version Control (DVC)"> </div> </a> </section> <p></p> <h2 id="dagshub-production-oriented-work" style="position:relative;">DAGsHub: Production Oriented Work<a href="#dagshub-production-oriented-work" aria-label="dagshub production oriented work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Next up, <a href="https://twitter.com/barazida" target="_blank" rel="nofollow noopener noreferrer"><strong>Nir Barazida</strong></a> from <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> <a href="https://dagshub.com/docs/workshops/production_oriented_work/?utm_content=bufferef4d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer" target="_blank" rel="nofollow noopener noreferrer">created a video</a> on Production-oriented work using a monorepo strategy and focusing on moving from research to production-ready code using Git and DVC. If you are a data scientist trying to wrap your head around going from your notebook to production, this may help!</p> <p> </p><section class="elp-content-holder"> <a href="https://dagshub.com/docs/workshops/production_oriented_work/?utm_content=bufferef4d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Production-Oriented Work with Git, DVC and DAGsHub</h4> <div class="elp-description">Nir Barazida's tutorial and video on who to use a monorepo strategy and go from your notebook to production-ready code.</div> <div class="elp-link">https://dagshub.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-10-15/dagshub-aa036fbcd9874d7c399ca6ef36cfc846.jpg" alt="Production-Oriented Work with Git, DVC and DAGsHub"> </div> </a> </section> <p></p> <h2 id="ml-data-versioning-with-dvc-how-to-manage-machine-learning-data" style="position:relative;">ML Data Versioning with DVC: How to Manage Machine Learning Data<a href="#ml-data-versioning-with-dvc-how-to-manage-machine-learning-data" aria-label="ml data versioning with dvc how to manage machine learning data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/piotr-storo%C5%BCenko-438087128/" target="_blank" rel="nofollow noopener noreferrer"><strong>Piotr Storożenko</strong></a> of <a href="https://appsilon.com/" target="_blank" rel="nofollow noopener noreferrer">Appsilon</a> wrote <a href="https://appsilon.com/ml-data-versioning-with-dvc/" target="_blank" rel="nofollow noopener noreferrer">a great tutorial</a> taking into account the many challenges data scientists and ML engineers encounter in their data versioning efforts and how DVC solves them. Do these scenarios from his article look familiar?</p> <blockquote> <p>Was it in <code>model_3final.pth</code> or <code>model_last.pth</code> that I used a bigger lerning rate?</p> <p>When did I start using data preprocessing, during <code>model_2a.pth</code> or <code>model_2aa.pth</code></p> <p>Is <code>model_7.pth</code> trained on the new dataset or on the old one?`</p> <p>Oh, gosh, which set of parameters and data have I used to train <code>model_2.pth</code>? It was pretty good in the end…”</p> </blockquote> <h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="raviraja-gantas-10-week-course-on-basic-mlops" style="position:relative;">Raviraja Ganta's 10-week course on Basic MLOps<a href="#raviraja-gantas-10-week-course-on-basic-mlops" aria-label="raviraja gantas 10 week course on basic mlops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Twitter and LinkedIn were a blaze in the last month when <a href="https://www.linkedin.com/in/ravirajag/" target="_blank" rel="nofollow noopener noreferrer"><strong>Raviraja Ganta</strong></a> announced his <a href="https://www.ravirajag.dev/blog/mlops-summary" target="_blank" rel="nofollow noopener noreferrer">10-Week Course</a> on MLOps basics. This course is chock full of resoures and practical tutorials to build your MLOps platform and knowledge. <a href="https://www.ravirajag.dev/blog/mlops-dvc" target="_blank" rel="nofollow noopener noreferrer">Week 3</a> of the course is about DVC and its ability to solve your versioning and reproducibility challenges. Be sure to check out <a href="https://github.com/graviraja/MLOps-Basics" target="_blank" rel="nofollow noopener noreferrer">the course repo</a> as well!</p> <p><a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer"><strong>MLOps Community</strong></a> is hosting him to speak about his course on October 20th. <a href="https://airtable.com/shrh5eGdEbcBsdEdq" target="_blank" rel="nofollow noopener noreferrer">Sign up to attend here!</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3cf1063b4d2bf22102e5a1e310032794/39600/ganta.png" alt="Raviraja Ganta's 10-Week MLOps Course" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Raviraja Ganta's 10-Week Course on MLOps Basics (<a href="https://www.ravirajag.dev/blog/mlops-summary" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="josh-willis-video-on-covid-simulations-with-dvc" style="position:relative;">Josh Willis video on COVID simulations with DVC<a href="#josh-willis-video-on-covid-simulations-with-dvc" aria-label="josh willis video on covid simulations with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This week, <a href="https://twitter.com/josh_wills/status/1441456258746249216" target="_blank" rel="nofollow noopener noreferrer">this Tweet comment</a> led me to <a href="https://mlconf.com/sessions/the-covid-scenario-pipeline-high-stakes-data-science/" target="_blank" rel="nofollow noopener noreferrer">this work</a> by <a href="https://twitter.com/josh_wills" target="_blank" rel="nofollow noopener noreferrer"><strong>Josh Wills.</strong></a> Josh was tapped by <a href="https://twitter.com/dpatil" target="_blank" rel="nofollow noopener noreferrer"><strong>DJ Patil</strong></a> to participate in some COVID simulation research early on in the pandemic in which he used DVC. In his presentation about the project, he tells of the tools he used and challenges of the use case. Nice DVC shout out at 19:56! Ah, the fruits of a Twitter 🐇🕳!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/tu7N8M-jwPU?rel=0&%3B=&%3Bshowinfo=0%3B&start=10" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="september-office-hours-video-transfer-learning-with-milecia-mcgregor" style="position:relative;">September Office Hours Video: Transfer Learning with Milecia McGregor<a href="#september-office-hours-video-transfer-learning-with-milecia-mcgregor" aria-label="september office hours video transfer learning with milecia mcgregor permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you missed last month's Office Hours <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a>, you can now catch the video! <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia's</strong></a> presentation was based on <a href="https://dvc.org/blog/transfer-learning-experiments" target="_blank" rel="nofollow noopener noreferrer">her blog post</a> on the same topic: Using Experiments for Transfer Learning. If you're curious about transfer learning in general, AlexNet and SqueezeNet in particular, or using DVC experiments and checkpoints to track all that you do, this video's for you!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/RmJbyQ36zVk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="quoc-tien-au-continuously-learning-on-the-job-as-a-data-scientist" style="position:relative;">Quoc-Tien Au: Continuously Learning on the Job as a Data Scientist<a href="#quoc-tien-au-continuously-learning-on-the-job-as-a-data-scientist" aria-label="quoc tien au continuously learning on the job as a data scientist permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://towardsdatascience.com/the-what-where-and-how-about-continuously-learning-on-the-job-as-a-data-scientist-b0a31ea4ac48" target="_blank" rel="nofollow noopener noreferrer">This Towards Data Science</a> article by <a href="https://www.linkedin.com/in/quoctienau/" target="_blank" rel="nofollow noopener noreferrer"><strong>Quoc-Tien Au</strong></a> entitled "The What, Where, and How about continuously learning on the job as a data scientist," speaks to some higher points on the need to have a mindset for continuous learning in the Data Science field. It's packed with great thought processes and resources on what to learn, where to learn, and how to keep learning while still getting your work done. Who stuggles with this? 😅</p> <p><img src="https://media.giphy.com/media/icJCVO3GPDbCvvfgpf/giphy.gif" alt="Thats Me I Am GIF by Ryn Dean"></p> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="amsterdam-off-site" style="position:relative;">Amsterdam Off-site<a href="#amsterdam-off-site" aria-label="amsterdam off site permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Most of our team members from Europe got together in Amsterdam recently for a couple days of brainstorming and team bonding. They went on a Treasure Hunt, ate Ramen (a favorite among our team) and had great discussions on how to make our tools and our team even better! Pictured below from front of the room left, going clockwise (to the back of the room and back up) are David Ortega, Helio Machado, David de la Iglesia Castro, Laurens Duijvesteijn, Ruslan Kupriev (hidden), Dmitry Petrov, Jelle Bouwman, Batuhan Taskaya,Svetlana Sachkovskaya, and Paweł Redzyński.</p> <p>Be sure to check out this section next month as our Americas team members will meet in San Francisco!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/24ec369b65ff5da6f58b0ccfe4ec622d/03346/amsterdam.jpg" alt="Europe Iterative Team Members meet in Amsterdam" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Iterative Team Members meet in Amsterdam (<a href="https://www.linkedin.com/in/gortegadavid/" target="_blank" rel="nofollow noopener noreferrer">Source: David Ortega</a>))</em></p> <h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/jordanwweber/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jordan Weber</strong></a> joins us from Los Angeles, California as our new Chief of Staff. She has previously held similar roles at venture captial and FinTech firms. In Jordan's free time she enjoys cooking, tennis, dance, and hiking! 🎾</p> <p><a href="https://www.linkedin.com/in/kenthom/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ken Thom</strong></a> joins us from Palo Alto, California as our new Director of Operations. His past work includes business operations, product management, software and hardware development. In his spare time he likes to spend time with his family, swim, ski, and hike! 🥾</p> <p><a href="https://www.linkedin.com/in/jon-burdo-59730a83/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jon Burdo</strong></a> joins us from Boston, Massachusetts as a Senior Software engineer. He's been working for the past few years as a machine learnng engineer with a focus on NLP. In his last role he used DVC and loved it, which is how he eventually ended up here! 🎉 In his spare time, Jon likes learning about open source software, tinkering with Linux, and inline skating.</p> <p><a href="https://www.linkedin.com/in/stephroy1/" target="_blank" rel="nofollow noopener noreferrer"><strong>Stephanie Roy</strong></a> joins the team as a Senior Software Engineer from Quebec, Canada. Our first Canadian team member! She has previously worked at LogMeln on one of their mobile apps. In her spare time she likes taking care of her plants in her indoor grow house, playing roller derby, and discovering new things to watch, listen to and eat! 😋</p> <p>Welcome to all our new team members! We are so glad you are here! 🙌🏼</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>And wouldn't you know it? We're still hiring! <a href="https://iterative.notion.site/Iterative-ai-is-Hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions including:</p> <ul> <li>Senior Software Engineer (ML, Labeling, Python)</li> <li>Senior Software Engineer (ML, Labeling, Python)</li> <li>Senior Software Engineer (ML, DevTools, Python)</li> <li>Field Data Scientist / Sales Engineer</li> <li>Developer Advocate (ML)</li> <li>Director / VP of Engineering (ML, DevTools)</li> <li>Director / VP of Product (ML, Data Infra, SaaS)</li> <li>Head of Talent</li> <li>Head of DevRel</li> </ul> <p>Please pass this info on to anyone you know that may fit the bill. We look forward to new team members! 🎉</p> <p><img src="https://media.giphy.com/media/120jXUxrHF5QJ2/giphy.gif" alt="High Five Amy Poehler GIF"></p> <h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here are a few important docs updates you may want to take a look at this month!</p> <h3 id="-pytorch-lightning" style="position:relative;">📖 PyTorch Lightning<a href="#-pytorch-lightning" aria-label=" pytorch lightning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We all have <a href="https://www.linkedin.com/search/results/all/?keywords=ilia%20sirotkin&origin=RICH_QUERY_SUGGESTION&position=0&searchId=e7bb3154-797a-44a5-a209-90ffece95246&sid=GeC" target="_blank" rel="nofollow noopener noreferrer"><strong>Ilia Sirotkin</strong></a> to thank for his contribution to our docs. He created the <a href="https://dvc.org/doc/dvclive/api-reference/ml-frameworks/pytorch-lightning" target="_blank" rel="nofollow noopener noreferrer">PyTorch Lightning integration docs</a> for all to use!</p> <h3 id="-cml-with-dvc-guide" style="position:relative;">📖 CML with DVC guide:<a href="#-cml-with-dvc-guide" aria-label=" cml with dvc guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://cml.dev/doc/cml-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Our updated CML with DVC Guide</a> provides updated code and streamlined information on Cloud Storage Provider credentials and GitHub Actions set up.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> CML & DVC <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> ubuntu<span class="token punctuation">-</span>latest <span class="token key atrule">container</span><span class="token punctuation">:</span> docker<span class="token punctuation">:</span>//ghcr.io/iterative/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">fetch-depth</span><span class="token punctuation">:</span> <span class="token number">0</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Train model <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> pip install -r requirements.txt # Install dependencies dvc pull data --run-cache # Pull data & run-cache from S3 dvc repro # Reproduce pipeline</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> Create CML report <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">REPO_TOKEN</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GITHUB_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> echo "## Metrics" >> report.md dvc metrics diff master --show-md >> report.md</span> <span class="token comment"># Publish confusion matrix diff</span> echo "<span class="token comment">## Plots" >> report.md</span> echo "<span class="token comment">### Class confusions" >> report.md</span> dvc plots diff \ <span class="token punctuation">-</span><span class="token punctuation">-</span>target classes.csv \ <span class="token punctuation">-</span><span class="token punctuation">-</span>template confusion \ <span class="token punctuation">-</span>x actual \ <span class="token punctuation">-</span>y predicted \ <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>vega master <span class="token punctuation">></span> vega.json vl2png vega.json <span class="token punctuation">-</span>s 1.5 <span class="token punctuation">></span> plot.png cml publish <span class="token punctuation">-</span><span class="token punctuation">-</span>md plot.png <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token comment"># Publish regularization function diff</span> echo "<span class="token comment">### Effects of regularization" >> report.md</span> dvc plots diff \ <span class="token punctuation">-</span><span class="token punctuation">-</span>target estimators.csv \ <span class="token punctuation">-</span>x Regularization \ <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>vega master <span class="token punctuation">></span> vega.json vl2png vega.json <span class="token punctuation">-</span>s 1.5 <span class="token punctuation">></span> plot.png cml publish <span class="token punctuation">-</span><span class="token punctuation">-</span>md plot.png <span class="token punctuation">></span><span class="token punctuation">></span> report.md cml send<span class="token punctuation">-</span>comment report.md</code></pre></div> <h3 id="-shtab" style="position:relative;">📖 Shtab<a href="#-shtab" aria-label=" shtab permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Team member <a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer"><strong>Casper da Costa-Luis</strong></a> has <a href="https://docs.iterative.ai/shtab/" target="_blank" rel="nofollow noopener noreferrer">created a docs website</a> for his python tab- completion script generator project <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer">shtab</a>. For more info checkout <a href="https://dvc.org/blog/shtab-completion-release" target="_blank" rel="nofollow noopener noreferrer">the original blog post</a> about it as well.</p> <h2 id="next-meetups" style="position:relative;">Next Meetups<a href="#next-meetups" aria-label="next meetups permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>For the second class of <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280814336/" target="_blank" rel="nofollow noopener noreferrer">DVC Learn,</a> join us to learn about getting started running experiments! This lesson will include information on how to use our <a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints" target="_blank" rel="nofollow noopener noreferrer">checkpoints</a> feature as well. We look forward to seeing you there!</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280814336/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC Learn - Getting Started with Running Experiments</h4> <div class="elp-description">Milecia McGregor shows us how to get started with DVC Experiments and Checkpoints</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-10-15/dvc_learn-2c4f8bdab833cb821b246bc5a7d0e118.png" alt="DVC Learn - Getting Started with Running Experiments"> </div> </a> </section> <p></p> <p>Be sure to join us at the <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/281355245/" target="_blank" rel="nofollow noopener noreferrer">November Office Hours Meetup,</a> where <a href="https://www.linkedin.com/in/maykon-schots/" target="_blank" rel="nofollow noopener noreferrer"><strong>Maykon Shots</strong></a> will talk about how he used DVC and CML to create an internal Kaggle competition for his team to arrive at their best models in their work for the largest bank in Brazil.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/281355245//" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC Office Hours - Creating an Internal Kaggle Competition with DVC and CML</h4> <div class="elp-description">Maykon Shots shows us how he used DVC and CML to create an internal Kaggle competition for his team</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-10-15/office-hours-meetup-409c5ab48d208e9a9cdc6871fd4c0937.png" alt="DVC Office Hours - Creating an Internal Kaggle Competition with DVC and CML"> </div> </a> </section> <p></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This month, it was exceedingly hard to pick just one Tweet. I'm leaving you with one that ballooned our followers over the last month. But there have been many! I encourage you to visit our newly created <a href="https://testimonial.to/iterative-open-source-community-shout-outs/all" target="_blank" rel="nofollow noopener noreferrer"><em>Wall of Love ❤️</em></a> to see all the beautiful Iterative tool love. 🛠❤️🤗</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Startups I'm *incredibly* bullish about: <a href="https://twitter.com/stripe">@Stripe</a>, <a href="https://twitter.com/Iterativeai">@IterativeAI</a>, <a href="https://twitter.com/huggingface">@HuggingFace</a>, and <a href="https://twitter.com/explosion_ai">@Explosion_AI</a>.<br><br>If you're an engineer/PM considering a career change (and it's that time of the year again, no? 😆)—but want to opt away from FAAMG, definitely consider one of the companies above.</p>— 👩‍💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1435256826375720964">September 7, 2021</a></blockquote> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/adding-data-to-build-a-more-generic-modelhttps://dvc.org/blog/adding-data-to-build-a-more-generic-modelTue, 05 Oct 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>You might be in the middle of training a model and then the business problem shifts. Now you have this model that has been going through the training process with a specific dataset and you need to make the model more generic.</p> <p>There's likely something that your model learned that can be useful on this new dataset, so you might not have to restart the entire training process. We'll do an example of updating a pre-trained model to use a broader dataset with DVC. By the end of this, you should see how you can handle this quickly and start running new experiments to get a more generic model.</p> <h2 id="the-original-pre-trained-model" style="position:relative;">The original pre-trained model<a href="#the-original-pre-trained-model" aria-label="the original pre trained model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>For this post, we'll be making a more generic image classifier by taking the original dataset with bees and ants and adding cats and dogs to it. You can clone <a href="https://github.com/iterative/pretrained-model-demo" target="_blank" rel="nofollow noopener noreferrer">this GitHub repo</a> to get the current bees and ants model and check out <a href="https://dvc.org/blog/transfer-learning-experiments" target="_blank" rel="nofollow noopener noreferrer">this post</a> on how we experimented with both AlexNet and SqueezeNet to build this model.</p> <p>So we're starting from our current bees and ants model and extending it to classify dogs and cats as well. We'll start by adding some cats and dogs data to our validation data and do some experiments with the current model to see how it performs on generic data.</p> <p>Then we'll add the cats and dogs data to the training data and watch how the model improves as we run experiments.</p> <h2 id="updating-the-dataset-with-dvc" style="position:relative;">Updating the dataset with DVC<a href="#updating-the-dataset-with-dvc" aria-label="updating the dataset with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To add the new cats and dogs dataset to the project, we'll use this DVC command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> https://github.com/iterative/dataset-registry blog/cats-dogs</span></code></pre></div> <p>This downloads a sample dataset with images of cats and dogs. You can use this command to download files or directories that are tracked by DVC or Git. This command can be used from anywhere in the file system, as long as DVC is installed.</p> <p>This will make a new directory called <code>./cats-dogs/data/</code> that was downloaded from the DVC remote and it has images for cats and dogs. Now we can slowly add in the new data to the existing data.</p> <p>We'll start by moving the <code>val</code> data for <code>cats</code> and <code>dogs</code> from the <code>/cats-dogs/data/</code> directory to the corresponding directory in <code>data/hymenoptera_data</code>.</p> <p><em>Just a quick note, cats and dogs don't really belong in the <code>hymenoptera</code> directory since that's specific to ants and bees, but it's the easiest and fastest way to add the data for this tutorial.</em></p> <p>With this new data in place, we can start training our model.</p> <h2 id="running-new-experiments-with-generic-data" style="position:relative;">Running new experiments with generic data<a href="#running-new-experiments-with-generic-data" aria-label="running new experiments with generic data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>With the updated data, let's run an experiment on the model and see how good the results are. To run a new experiment, open your terminal and make sure you have a virtual environment enabled. Then run this command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>Once the training epochs are finished, run the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span> <span class="token parameter variable">--include-metrics</span> step,acc,val_acc,loss,val_loss <span class="token punctuation">\</span> <span class="token parameter variable">--include-params</span> lr,momentum</span></code></pre></div> <p>The <code>--no-timestamp</code> hides the timestamps from table. The <code>--includes-metrics</code> option lets us choose which metrics we want to show in the table. The <code>--includes-params</code> option does the same for hyperparameters. This gives us a table that's easier to read quickly.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> </span> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.86885<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.46<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.31573<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3.7067<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> │ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09 │ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09 │ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09 ├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09 </span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>You'll notice that the validation accuracy is really low. That's because the training metrics are based on bees and ants while the validation metrics are based on bees, ants, cats, and dogs. If we looked at the validation metrics by class, they'd likely be better for bees and ants than cats and dogs.</p> <p>That means we should probably add more data to the training dataset.</p> <h2 id="adding-the-cats-data-to-the-training-dataset" style="position:relative;">Adding the cats data to the training dataset<a href="#adding-the-cats-data-to-the-training-dataset" aria-label="adding the cats data to the training dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Let's add the <code>train</code> data for <code>cats</code> to the corresponding directory in <code>data/hymenoptera_data</code> and go through another experiment run with a different learning rate. With this new data, we can run another experiment. One important thing to note here is that we're using checkpoints in our experiments. That's how we get the metrics for each training epoch.</p> <p>If we want to run a fresh experiment that doesn't resume training from the last epoch, we need to reset our experiment. That's what we're going to do with this command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--reset</span></span></code></pre></div> <p>This will reset all of the existing checkpoints and excute the training script. Once it's finished, let's take a look at the metrics table with this command. It's the same as the one we ran last time.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span> <span class="token parameter variable">--include-metrics</span> step,acc,val_acc,loss,val_loss <span class="token punctuation">\</span> <span class="token parameter variable">--include-params</span> lr,momentum</span></code></pre></div> <p>Now you'll have a table that shows both experiments and you can see how much better the new one did with the <code>cats</code> data added.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> </span> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.91389<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.20506<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.66306<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> │ ╓ 9405575 [exp-54e8a] 3 0.91389 0.87 0.20506 0.66306 0.001 0.09 │ ╟ 856d80f 2 0.90215 0.87333 0.27204 0.61631 0.001 0.09 │ ╟ 23dc98f 1 0.87671 0.86 0.35964 0.61713 0.001 0.09 ├─╨ 99a3c34 0 0.71429 0.82 0.67674 0.62798 0.001 0.09 │ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09 │ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09 │ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09 ├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09 </span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>There's another way you can look at the difference between the model before we added the <code>cats</code> data and after. If you run this in your terminal, you'll get a plot comparing the two experiments.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> exp-23593 exp-54e8a</span></code></pre></div> <p>The <code>exp-23593</code> and <code>exp-54e8a</code> values are the ids for the experiments you want to compare. You'll see a new file gets generated in the <code>dvc_plots</code> directory in your project. That's where you'll find the <code>index.html</code> file you should open in your browser. You'll see something similar to this.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/dc01547ad9f771f11e39ba81d45658b3/39600/with-cats-data.png" alt="plots comparing the accuracy, validation accuracy, loss, and validation loss for all epochs of each experiment" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>There's a huge difference in the accuracy of our model after we've added this additional data. Let's see if we can make it even better by adding the <code>dogs</code> data.</p> <h2 id="adding-the-dogs-data-to-the-training-dataset" style="position:relative;">Adding the dogs data to the training dataset<a href="#adding-the-dogs-data-to-the-training-dataset" aria-label="adding the dogs data to the training dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We'll add the <code>train</code> data for <code>dogs</code> to the corresponding directory in <code>data/hymenoptera_data</code> just like we did for the <code>cats</code> data. Now we can run a new experiment with all of the new data included. We'll still need to reset the experiment like before, so run the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--reset</span></span></code></pre></div> <p>Once the training epochs are finished, we can take one more look at that metrics table.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span> <span class="token parameter variable">--include-metrics</span> step,acc,val_acc,loss,val_loss <span class="token punctuation">\</span> <span class="token parameter variable">--include-params</span> lr,momentum</span></code></pre></div> <p>Now we'll have all three experiments to compare.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> </span> ──────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>3<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.8795<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.90667<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.29302<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.25752<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>data-change<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> │ ╓ c20220f [exp-82f70] 3 0.8795 0.90667 0.29302 0.25752 0.001 0.09 │ ╟ fcb5a0b 2 0.85915 0.92333 0.38274 0.25257 0.001 0.09 │ ╟ 3768821 1 0.80751 0.84667 0.47681 0.40228 0.001 0.09 ├─╨ 7e1b8fb 0 0.64632 0.84 0.87301 0.46744 0.001 0.09 │ ╓ 9405575 [exp-54e8a] 3 0.91389 0.87 0.20506 0.66306 0.001 0.09 │ ╟ 856d80f 2 0.90215 0.87333 0.27204 0.61631 0.001 0.09 │ ╟ 23dc98f 1 0.87671 0.86 0.35964 0.61713 0.001 0.09 ├─╨ 99a3c34 0 0.71429 0.82 0.67674 0.62798 0.001 0.09 │ ╓ 3b3a2a2 [exp-23593] 3 0.86885 0.46 0.31573 3.7067 0.001 0.09 │ ╟ 93d015d 2 0.83197 0.41333 0.36851 3.4259 0.001 0.09 │ ╟ d474c42 1 0.79918 0.43333 0.46612 3.286 0.001 0.09 ├─╨ 1582b4b 0 0.52869 0.39 0.94102 2.5967 0.001 0.09 </span> ────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>These results make sense for the experiments we've run. We're paying attention to the validation accuracy here because this gives us a fair comparison of what's happening as we add more data.</p> <p>The first experiment's training metrics are for bees and ants. The second experiment's training metrics are for bees, ants, and cats. And the third experiment's training metrics are for all four classes. So we can't really compare these metrics.</p> <p>We can look at a comparison between the experiments with the <code>cats</code> data and both the <code>cats</code> and <code>dogs</code> data.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> exp-23593 exp-54e8a exp-82f70</span></code></pre></div> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5dc719adacedff151914e4fb5b634557/39600/with-cats-and-dogs-data.png" alt="plot of differences between model with just cats data and model with both cats and dogs data" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>The results you see line up with what is expected for the validation metrics based on how we added the data to the training set. Now you can keep running experiments until you get your model tuned like you need it!</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When you want to change datasets quickly and start tracking how they affect our model, using a DVC remote makes it easy to do so on different computers. You'll be able to quickly upload and download GBs of data and see how changes affect individual experiments.</p> <p>If you need help with anything DVC or CML, make sure to <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">join our Discord community</a>! We're always answering questions and having good conversations with everybody that shows up.</p>https://dvc.org/blog/september-21-community-gemshttps://dvc.org/blog/september-21-community-gemsThu, 30 Sep 2021 00:00:00 GMT<h3 id="is-there-a-way-to-share-data-across-multiple-on-premise-machines-so-that-users-can-train-models-individually" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/879718738163826698" target="_blank" rel="nofollow noopener noreferrer">Is there a way to share data across multiple on-premise machines so that users can train models individually?</a><a href="#is-there-a-way-to-share-data-across-multiple-on-premise-machines-so-that-users-can-train-models-individually" aria-label="is there a way to share data across multiple on premise machines so that users can train models individually permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a good scenario to try out one of these use cases:</p> <ul> <li><a href="https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache" target="_blank" rel="nofollow noopener noreferrer">Configuring a DVC cache</a></li> <li><a href="https://dvc.org/doc/use-cases/fast-data-caching-hub" target="_blank" rel="nofollow noopener noreferrer">Sharing a development server</a></li> </ul> <p>You can have a single storage location mounted on each workstation to serve as a central cache.</p> <p>That way all of your machine learning engineers can work with the same data in a central location.</p> <p>Thanks for the question @fchpriani!</p> <h3 id="if-we-change-the-remote-we-are-using-in-our-workspace-does-that-effect-where-dvc-pulls-and-pushes-data-to-for-all-historical-commits" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/882951655979622400" target="_blank" rel="nofollow noopener noreferrer">If we change the remote we are using in our workspace, does that effect where DVC pulls and pushes data to for all historical commits?</a><a href="#if-we-change-the-remote-we-are-using-in-our-workspace-does-that-effect-where-dvc-pulls-and-pushes-data-to-for-all-historical-commits" aria-label="if we change the remote we are using in our workspace does that effect where dvc pulls and pushes data to for all historical commits permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for bringing this up @mattlbeck!</p> <p>Right now DVC just uses whichever remote is configured in a respective commit that you've checked out.</p> <p>To clarify things a bit more, if you run <code>dvc push/pull</code> in a workspace with a new remote, that new remote will be used for <code>--all-branches</code>, <code>--all-tags</code>, and <code>--all-commits</code>.</p> <h3 id="is-there-a-command-to-execute-only-a-few-specific-stages-in-a-dvc-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/888054401640562698" target="_blank" rel="nofollow noopener noreferrer">Is there a command to execute only a few specific stages in a DVC pipeline?</a><a href="#is-there-a-command-to-execute-only-a-few-specific-stages-in-a-dvc-pipeline" aria-label="is there a command to execute only a few specific stages in a dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can freeze the stages that you do not want to be executed.</p> <p><a href="https://dvc.org/doc/command-reference/freeze"><code>dvc freeze</code></a> and <a href="https://dvc.org/doc/command-reference/unfreeze"><code>dvc unfreeze</code></a> help you do this. Or you can use <a href="https://dvc.org/doc/command-reference/repro#--glob"><code>dvc repro --glob pattern*</code></a> together with <code>-s</code> to match the stages you want to run.</p> <p>Thanks for the question @LucZ!</p> <h3 id="when-running-queued-experiments-is-it-expected-for-dvc-to-run-dvc-checkout-for-each-experiment" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/883144885417431081" target="_blank" rel="nofollow noopener noreferrer">When running queued experiments, is it expected for DVC to run <code>dvc checkout</code> for each experiment?</a><a href="#when-running-queued-experiments-is-it-expected-for-dvc-to-run-dvc-checkout-for-each-experiment" aria-label="when running queued experiments is it expected for dvc to run dvc checkout for each experiment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This brings up a good point, so thanks @dmh!</p> <p>If you usually run experiments with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>, you'll notice that it doesn't checkout any files. That's because the experiment is running in the current workspace.</p> <p>When you use <a href="https://dvc.org/doc/command-reference/exp/run#--queue"><code>dvc exp run --queue</code></a> or <a href="https://dvc.org/doc/command-reference/exp/run#--run-all"><code>dvc exp run --run-all</code></a>, it runs each experiment in its own separate temp workspace, so files have to be checked out into those workspaces. Check out the notes in <a href="https://dvc.org/doc/command-reference/exp/run#queueing-and-parallel-execution" target="_blank" rel="nofollow noopener noreferrer">this reference doc on queueing and parallel execution</a> for more details.</p> <h3 id="when-working-with-a-data-registry-is-it-possible-to-pull-a-specific-project-folder-modify-it-then-push-git-changes-and-dvc-push-to-the-remote-storage-without-pulling-data-from-all-the-directories" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/887427010044002345" target="_blank" rel="nofollow noopener noreferrer">When working with a data registry, is it possible to pull a specific project folder, modify it, then push Git changes and <code>dvc push</code> to the remote storage without pulling data from all the directories?</a><a href="#when-working-with-a-data-registry-is-it-possible-to-pull-a-specific-project-folder-modify-it-then-push-git-changes-and-dvc-push-to-the-remote-storage-without-pulling-data-from-all-the-directories" aria-label="when working with a data registry is it possible to pull a specific project folder modify it then push git changes and dvc push to the remote storage without pulling data from all the directories permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is definitely possible. The most common way to handle this is by working in the specific folder. You can <a href="https://dvc.org/doc/command-reference/pull#-R"><code>dvc pull -R</code></a> from the sub-directory, then make your changes in the sub-directory, and <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> the changes. Then you can do a <code>git commit</code> and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> to manage those changes.</p> <p>You can also use a Git sub-repo and a DVC sub-repo to do this if each folder has a distinct project. Use <code>git init</code> and <a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a> in the project folders and then you can pull them down, modify, commit and push commit back.</p> <p>Really good question @ross.tsenov!</p> <h3 id="is-it-possible-to-auto-generate-reports-with-metrics-and-plots-by-running-dvc-in-a-cml-job-when-the-data-is-stored-in-aws-bucket-instead-of-github" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/877072469188575262" target="_blank" rel="nofollow noopener noreferrer">Is it possible to auto-generate reports with metrics and plots by running DVC in a CML job when the data is stored in AWS bucket instead of GitHub?</a><a href="#is-it-possible-to-auto-generate-reports-with-metrics-and-plots-by-running-dvc-in-a-cml-job-when-the-data-is-stored-in-aws-bucket-instead-of-github" aria-label="is it possible to auto generate reports with metrics and plots by running dvc in a cml job when the data is stored in aws bucket instead of github permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for asking @Masmoudi!</p> <p>When you need to retrieve data, you can run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to get it from the S3 bucket. If you run into an error with this, try adding <code>uses: iterative/setup-dvc@v1</code> to the <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> command. This could happen because the default CML action doesn't install DVC.</p> <p>If you want more details on how CML works in GitHub, check out <a href="https://cml.dev/doc/start/github#the-cml-github-action" target="_blank" rel="nofollow noopener noreferrer">the docs</a>!</p> <h3 id="what-mechanism-can-i-use-in-gitlab-to-trigger-a-ci-pipeline-periodically-so-that-models-get-re-trained-and-logged-to-dvc-automatically" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/887306645883990037" target="_blank" rel="nofollow noopener noreferrer">What mechanism can I use in GitLab to trigger a CI pipeline periodically so that models get re-trained and logged to DVC automatically?</a><a href="#what-mechanism-can-i-use-in-gitlab-to-trigger-a-ci-pipeline-periodically-so-that-models-get-re-trained-and-logged-to-dvc-automatically" aria-label="what mechanism can i use in gitlab to trigger a ci pipeline periodically so that models get re trained and logged to dvc automatically permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can use <a href="https://docs.gitlab.com/ee/ci/pipelines/schedules.html" target="_blank" rel="nofollow noopener noreferrer">pipeline schedules</a> to train your model periodically and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> the results.</p> <p>Good question @mihaj!</p> <hr> <p><img src="https://media.giphy.com/media/8UF0EXzsc0Ckg/giphy.gif" alt="Its Over GIF"></p> <p>At our October Office Hours Meetup we will be going over how to get started with data version control. <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280814318/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/refactorhttps://dvc.org/blog/refactorFri, 24 Sep 2021 00:00:00 GMT<p>It is common for big codebases to grow to a complexity where it is nearly impossible for someone to tediously and flawlessly refactor things manually everywhere. The main problem with existing automated solutions (such as regex-based find-and-replace tools) is that they treat source code like a plain text document. This often results in false positives (tools making changes when they shouldn't) and/or false negatives (not changing what they should). This is primarily due to a lack of ability to truly encapsulate structural concepts of the programming language: syntax and grammar that are impossible to manifest in regexes.</p> <p>This is where <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree" target="_blank" rel="nofollow noopener noreferrer">AST</a>s shine. They are the common building blocks of source code; produced by a parser that actually understands the language's syntax and creates a tree object where smaller parts (e.g. tokens) are ordered in a way that they are related by their syntactical meanings.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">password <span class="token operator">=</span> <span class="token builtin">input</span><span class="token punctuation">(</span><span class="token string">"password? "</span><span class="token punctuation">)</span> <span class="token keyword">if</span> password <span class="token operator">==</span> secrets<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"my_password"</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"correct"</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"incorrect"</span><span class="token punctuation">)</span></code></pre></div> <p>For example, the AST for the code above will look like this:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 592.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6ec34840bbe994244ffbd74bb2b65984/39600/ast.png" alt="Fundamentals of MLOps" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Abstract Syntax Tree</em></p> <p>The top-most "root" node of this tree represents a single Python file. Each file consists of a number of statements (e.g. function definitions, loops, etc.). For our example we have only 2 statements: an assignment (to <code>password</code>), and an <code>if</code> statement. Each of these statements in turn has child nodes as defined by <a href="https://docs.python.org/3/library/ast.html#abstract-grammar" target="_blank" rel="nofollow noopener noreferrer">Python's ASDL</a>.</p> <h2 id="refactoring-source-code-through-asts" style="position:relative;">Refactoring source code through ASTs<a href="#refactoring-source-code-through-asts" aria-label="refactoring source code through asts permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://github.com/isidentical/refactor" target="_blank" rel="nofollow noopener noreferrer">Refactor</a> simplifies the process of matching ASTs. It then applies your transformations to these ASTs without touching the other parts of your source code.</p> <p>For example, consider this code:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">foo <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">2</span> <span class="token punctuation">]</span> foo_2 <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token string">'a'</span><span class="token punctuation">,</span> <span class="token operator">*</span>foo<span class="token punctuation">]</span> <span class="token keyword">if</span> foo<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">>=</span> <span class="token number">1</span><span class="token punctuation">:</span> <span class="token keyword">assert</span> secrets<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"foo"</span><span class="token punctuation">)</span> <span class="token operator">==</span> foo</code></pre></div> <p>As a simple example, let's try to find and replace all instances of the <code>foo</code> variable with <code>bar</code>… but without changing things inside strings or partial matches like <code>foo_2</code>.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> ast <span class="token keyword">import</span> refactor</code></pre></div> <p>The first thing we need to do is define a rule. Each rule is a class that defines a single entrypoint (<code>match())</code>), takes AST nodes from the tree, and either rejects them (via raising an <code>AssertionError</code> or just returning <code>None</code>) or accepts them (via returning a <code>refactor.Action</code>).</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">class</span> <span class="token class-name">ReplaceFoo</span><span class="token punctuation">(</span>refactor<span class="token punctuation">.</span>Rule<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">match</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> node<span class="token punctuation">)</span><span class="token punctuation">:</span></code></pre></div> <p>Next, in the <code>match()</code> method, we will look for all <code>Name</code>s (which is what the actual identifier is wrapped in), and check whether its <code>id</code> is <code>foo</code>.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> <span class="token keyword">assert</span> <span class="token builtin">isinstance</span><span class="token punctuation">(</span>node<span class="token punctuation">,</span> ast<span class="token punctuation">.</span>Name<span class="token punctuation">)</span> <span class="token keyword">assert</span> node<span class="token punctuation">.</span><span class="token builtin">id</span> <span class="token operator">==</span> <span class="token string">"foo"</span></code></pre></div> <p>If any of these assertions fail, the function will terminate and the engine will move to the next <code>node</code> in the tree. But if we have a match, we need to return some sort of an action. The simplest thing we can return is a <code>refactor.ReplacementAction</code> which takes this <code>node</code> and replaces it with the given argument.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"> <span class="token keyword">return</span> refactor<span class="token punctuation">.</span>ReplacementAction<span class="token punctuation">(</span> node<span class="token punctuation">,</span> ast<span class="token punctuation">.</span>Name<span class="token punctuation">(</span><span class="token string">"bar"</span><span class="token punctuation">,</span> node<span class="token punctuation">.</span>ctx<span class="token punctuation">)</span> <span class="token punctuation">)</span></code></pre></div> <p>And that's it! To run this refactoring, we can simply create a CLI application from our rules via <code>refactor.run()</code>:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span> refactor<span class="token punctuation">.</span>run<span class="token punctuation">(</span>rules<span class="token operator">=</span><span class="token punctuation">[</span>ReplaceFoo<span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div> <p>If we run it on the file above, we will get this <code>diff</code>:</p> <div class="gatsby-highlight" data-language="diff"><pre class="language-diff"><code class="language-diff"><span class="token coord">@@ -1,9 +1,9 @@</span> <span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>foo = [ </span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>bar = [ </span><span class="token unchanged"><span class="token prefix unchanged"> </span> 1, <span class="token prefix unchanged"> </span> 2 <span class="token prefix unchanged"> </span>] </span> <span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>foo_2 = ['a', *foo] </span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>foo_2 = ['a', *bar] </span> <span class="token deleted-sign deleted"><span class="token prefix deleted">-</span>if foo[0] >= 1: <span class="token prefix deleted">-</span> assert secrets.get("foo") == foo </span><span class="token inserted-sign inserted"><span class="token prefix inserted">+</span>if bar[0] >= 1: <span class="token prefix inserted">+</span> assert secrets.get("foo") == bar</span></code></pre></div> <p>All instances of the <code>foo</code> variable have been replaced, but items like <code>foo_2</code> and <code>"foo"</code> are left alone as expected!</p> <h2 id="going-deeper" style="position:relative;">Going Deeper<a href="#going-deeper" aria-label="going deeper permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Obviously not all refactorings are as simple as this, so <code>refactor</code> is equipped with more features like different actions, observers and representatives for context manager. If you are curious about these and more advanced features, be sure to check out the <a href="https://refactor.readthedocs.io/en/latest" target="_blank" rel="nofollow noopener noreferrer"><code>refactor</code> documentation</a>!</p>https://dvc.org/blog/september-21-dvc-heartbeathttps://dvc.org/blog/september-21-dvc-heartbeatTue, 14 Sep 2021 00:00:00 GMT<h1 id="this-months-head-turning-news-from-the-community" style="position:relative;">This month's head-turning News from the Community!<a href="#this-months-head-turning-news-from-the-community" aria-label="this months head turning news from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/1hWHUCgi3wKT6/giphy.gif?cid=ecf05e47a5sz6kvyp4h1swih08yokkbdfr39pq9pxscg975u&rid=giphy.gif&ct=g" alt="Head Turning Content from the DVC Community!"></p> <h3 id="tezan-sahus-4-part-blog-series" style="position:relative;">Tezan Sahu's 4-part blog series<a href="#tezan-sahus-4-part-blog-series" aria-label="tezan sahus 4 part blog series permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Welcome to September! We'll kick off this month's Community picks with a four-part series by <a href="https://twitter.com/SahuTezan" target="_blank" rel="nofollow noopener noreferrer"><strong>Tezan Sahu</strong></a> on the <a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-1-a-gentle-introduction-to-mlops-1b184d2c32a8" target="_blank" rel="nofollow noopener noreferrer"><strong>Fundamentals of MLOps.</strong></a> Tehan introduces readers to the core ideas behind taking the best practices of DevOps and how they are being adapted to machine learning projects that deploy large scale AI powered applications. The series includes:</p> <ul> <li>Part 1: <a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-1-a-gentle-introduction-to-mlops-1b184d2c32a8" target="_blank" rel="nofollow noopener noreferrer">A Gentle Introduction to MLOps</a></li> <li>Part 2: <a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-2-data-model-management-with-dvc-6be2ad284ec4" target="_blank" rel="nofollow noopener noreferrer">Data & Model Management with DVC</a> We love this part best! ❤️😉</li> <li>Part 3: <a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-3-ml-experimentation-using-pycaret-747f14e4c28d" target="_blank" rel="nofollow noopener noreferrer">MLExperimentation with PyCaret</a></li> <li>Part 4: <a href="https://tezansahu.medium.com/fundamentals-of-mlops-part-4-tracking-with-mlflow-deployment-with-fastapi-61614115436" target="_blank" rel="nofollow noopener noreferrer">Tracking with MLFlow & Deployment with Fast API</a></li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f7a737bfd6e8b7a186ba2775d773d571/39600/tezan-sahu.png" alt="Fundamentals of MLOps" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Tezan Sahu's 4 part series on the Fundamentals of MLOps <a href="https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <p>If you follow the steps through this series, you will learn how to build and deploy an end-to-end ML project - all the steps leading to production!</p> <h3 id="miguel-méndez-tutorial-on-dvc--mmdetection" style="position:relative;">Miguel Méndez' Tutorial on DVC + MMdetection<a href="#miguel-m%C3%A9ndez-tutorial-on-dvc--mmdetection" aria-label="miguel méndez tutorial on dvc mmdetection permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This month <a href="https://www.linkedin.com/in/miguel-mendez/" target="_blank" rel="nofollow noopener noreferrer">Miguel Méndez</a> of <a href="https://www.gradiant.org/en//" target="_blank" rel="nofollow noopener noreferrer">Gradiant</a> brings us a guide on object detection using the <a href="">MMdetection</a> framework in conjunction with DVC to design the pipeline, version models and monitor training progress. This follows his <a href="https://mmeendez8.github.io/2021/07/01/dvc-tutorial.html" target="_blank" rel="nofollow noopener noreferrer">first guide</a> covering how to version your datasets with DVC, which we shared in the <a href="https://dvc.org/blog/july-21-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">July Heartbeat.</a></p> <p>In <a href="https://mmeendez8.github.io/2021/08/30/mmdet-dvc-tutorial.html" target="_blank" rel="nofollow noopener noreferrer">this new guide,</a> you'll gain a thorough understanding of the steps, have access to <a href="https://github.com/mmeendez8/mmdetection_dvc" target="_blank" rel="nofollow noopener noreferrer">his repo</a> for the project, and find his thoughts on scaling hyperparameter tuning through this <a href="https://github.com/iterative/dvc/issues/5477#issuecomment-905440724" target="_blank" rel="nofollow noopener noreferrer">open issue</a> about exeperiments that we are trying to resolve. Join the conversation! We'd love your input!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4fac790212019b4e53a1735ed91feb92/39600/mmdetection.png" alt="DVC + MMdetection" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Miguel Méndez' second guide in a series using DVC in an object detecton project <a href="https://mmeendez8.github.io/2021/08/30/mmdet-dvc-tutorial.html" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h2 id="hrittik-roys-complete-intro-to-dvc" style="position:relative;">Hrittik Roy's Complete Intro to DVC<a href="#hrittik-roys-complete-intro-to-dvc" aria-label="hrittik roys complete intro to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>It was just a few short months ago when <a href="">Hrittik Roy</a> joined us at his first <a href="">DVC Office Hours</a>. Now he's written <a href="https://dev.to/hrittikhere/dvc-git-for-data-a-complete-intro-4626" target="_blank" rel="nofollow noopener noreferrer">DVC (Git for Data): A Complete Tutorial</a> on DVC and how it solves the challenges of ML engineers. In this piece he takes you through set up, pipeline and versioning, experiments and sharing through our built in shared caching, so that you and your teammates can reduce resource use when focusing on a subset of datasets as you move through your project.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d38ded232422aedc9d34369951b99b33/39600/hrittik-roy.png" alt="DVC (Git for Data): A complete Intro" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Hrittik Roy's Complete Intro on DVC <a href="https://dev.to/hrittikhere/dvc-git-for-data-a-complete-intro-4626" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h2 id="andrey-kurenkovs-curated-list-of-ai-newsletters" style="position:relative;">Andrey Kurenkov's curated list of AI Newsletters<a href="#andrey-kurenkovs-curated-list-of-ai-newsletters" aria-label="andrey kurenkovs curated list of ai newsletters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In case you missed it, <a href="https://twitter.com/andrey_kurenkov?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor" target="_blank" rel="nofollow noopener noreferrer">Andy Kurenkov</a> tweeted that he finally got around to writing about his list of 21 favorite AI Newsletters. You can find the article <a href="https://medium.com/@andreykurenkov/the-best-ai-newsletters-483dc75134b" target="_blank" rel="nofollow noopener noreferrer">right here.</a> Be sure to check it out and get reading…</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279723437/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">One PhD student’s curated list of 21 newsletters to help you keep up with AI news and research</h4> <div class="elp-description">Andrey Kurenkov's curated list of the best AI newsletters</div> <div class="elp-link">https://medium.com.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-09-14/andrey-wordcloud-021ffe734cdcce52fa574effb88fb851.png" alt="One PhD student’s curated list of 21 newsletters to help you keep up with AI news and research"> </div> </a> </section> <p></p> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We know there were a lot of peeps out on holiday over the last month so let me fill you in!</p> <p><img src="https://media.giphy.com/media/lz7212bWGdZbkm30KJ/giphy.gif?cid=ecf05e47hg6at9zmqb1pglypfrzi6vrgdsbay6zgza7wmwwu&rid=giphy.gif&ct=g" alt="Grab the popcorn!"></p> <h2 id="yes-thats-right-vs-code-extension-is-coming" style="position:relative;">Yes, that's right, VS Code extension is coming!<a href="#yes-thats-right-vs-code-extension-is-coming" aria-label="yes thats right vs code extension is coming permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/DynamicWebPaige" target="_blank" rel="nofollow noopener noreferrer">Paige Bailey</a> let the cat out of the bag <a href="https://twitter.com/DynamicWebPaige/status/1430920240251035649?s=20" target="_blank" rel="nofollow noopener noreferrer">with her tweet</a> about the developent of our VS Code extension for DVC. We're getting closer every day! If you'd like to be a part of the beta testing (how could you not?) <a href="https://t.co/F64H9yyDH9?amp=1" target="_blank" rel="nofollow noopener noreferrer">join us here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 468px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c04073dc47a6902eb849b9b44ffab032/39600/VSCode.png" alt="VS Code Extension for DVC" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Paige Bailey let's the cat out of the bag <a href="https://twitter.com/DynamicWebPaige/status/1430920240251035649?s=20" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h2 id="-docs-updates" style="position:relative;">📖 Docs Updates<a href="#-docs-updates" aria-label=" docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As promised, we will be adding this section to the Heartbeat each month so that you can stay in the know about the doc updates that will most impact your workflows. You won't want to miss these…</p> <h3 id="-fast-and-secure-data-caching-hub" style="position:relative;">📖 Fast and Secure Data Caching Hub<a href="#-fast-and-secure-data-caching-hub" aria-label=" fast and secure data caching hub permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>First up, a new doc on our <a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#fast-and-secure-data-caching-hub" target="_blank" rel="nofollow noopener noreferrer">Fast and Secure Data Caching Hub.</a> Checkout this doc to learn how DVC's built-in data caching lets you implement a simple and efficient storage layer globally - FOR YOUR ENTIRE TEAM. This lets you:</p> <ul> <li>⏱ Speed data transfers from massive object stores currently on the cloud</li> <li>💰 Pay only for fast access to frequently-used data</li> <li>🙅🏻‍♂️ Avoid extra downloads and duplicating data</li> <li>⚡️ Switch data inputs fast (without re-downloading) on a shared server used for machine learning experiments.</li> </ul> <p>Status: Must read. 📖</p> <p><img src="https://dvc.org/2021-09-14/fcaching-57a95a2297f0fbd38a2625ae0177046b.gif" alt="Fast and Secure Data Cachin Hub"> <em>Fast and Secure Data Cachin Hub <a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#fast-and-secure-data-caching-hub" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h3 id="-cicd-for-machine-learning" style="position:relative;">📖 CI/CD for Machine Learning<a href="#-cicd-for-machine-learning" aria-label=" cicd for machine learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Is this your life?</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 612px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/101349ee0d2416d578b584e83f12ae55/39600/cicd4ml-0.png" alt="Rage Quit Job" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Is this your life? <a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning#continuous-integration-and-deployment-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <p>Our latest doc, <a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning#continuous-integration-and-deployment-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">Continuous Integration and Deployment for Machine Learning,</a> shows you how to move from the above chaos to CI/CD victory through:</p> <ul> <li>✅ Data validation</li> <li>✅ Model validation</li> <li>🎟 Provisioning</li> <li>📈 Metrics</li> </ul> <p>Read the whole doc to learn how DVC and CML will enable you to run entire experiments/research online and remove most of your managment headaches to look more like this. 👇🏼</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 561px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1b000d95083794b7ebfb4e8b901881f1/39600/cicd4ml-1.png" alt="Traditional ML meets CI/CD" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Traditional ML meets CI/CD with DVC and CML <a href="https://dvc.org/doc/use-cases/ci-cd-for-machine-learning#continuous-integration-and-deployment-for-machine-learning" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h3 id="-need-to-clean-up-your-worksapce" style="position:relative;">📖 Need to Clean up Your Worksapce?<a href="#-need-to-clean-up-your-worksapce" aria-label=" need to clean up your worksapce permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://dvc.org/doc/user-guide/experiment-management/cleaning-experiments" target="_blank" rel="nofollow noopener noreferrer">Cleaning Up Experiments</a> has been made bright and shiny and new to do the same with your experiments. Be sure to check it out!</p> <h3 id="-hugging-face-integration-with-dvc-live" style="position:relative;">📖 Hugging Face Integration with DVC Live<a href="#-hugging-face-integration-with-dvc-live" aria-label=" hugging face integration with dvc live permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://huggingface.co/" target="_blank" rel="nofollow noopener noreferrer">Hugging Face</a> fans now have an integration with DVCLive! Checkout how to <a href="https://dvc.org/doc/dvclive/api-reference/ml-frameworks/huggingface" target="_blank" rel="nofollow noopener noreferrer">get set up here!</a> Thanks <a href="https://github.com/pacifikus" target="_blank" rel="nofollow noopener noreferrer">@pacifikus</a>, for the contribution! 🙏🏼</p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This Thursday at our <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280212578/" target="_blank" rel="nofollow noopener noreferrer">September Office Hours Meetup</a>, <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer">Milicia McGregor</a> will be presenting her tutorial on <a href="https://dvc.org/blog/transfer-learning-experiments" target="_blank" rel="nofollow noopener noreferrer">Using Experiments For Transfer Learning.</a> Join us on September 16th at 3:00 pm UTC! RSVP at this link below! 👇🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/280212578/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC Office Hours - Using Experiments For Transfer Learning</h4> <div class="elp-description">Milecia McGregor shows how to use DVC experiment tracking to compare models in a transfer learning project</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-09-14/pretrained-models-67709ed24c45932295bf5818741399d6.png" alt="DVC Office Hours - Using Experiments For Transfer Learning"> </div> </a> </section> <p></p> <h2 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our August Meetup video is out, so if you weren't able to make it, you can catch all the details on <a href="https://twitter.com/AntoineToubhans" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhan's</a> tutorial on <a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer">DVC + Streamlit = ❤️</a></p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/F318uN01v7M?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We'll be introducing some new team member next month, but we are still hiring. So do checkout our open positions <a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">here</a> to find details of all the positions including:</p> <ul> <li>Senior Front-End Engineer (TypeScript, Node, React)</li> <li>Senior Software Engineer (ML, Dev Tools, Python)</li> <li>Senior Software Engineer (ML, Data Infra, GoLang)</li> <li>Machine Learning Engineer/Field Data Scientist</li> <li>Developer Advocate (ML)</li> <li>Director/VP of Engineering (ML, DevTools)</li> <li>Director/VP of Product (ML, Data Infra, SaaS)</li> <li>Director/VP of Operations/Chief of Staff</li> </ul> <p>Please pass this info on to anyone you know that may fit the bill. We look forward to new team members! 🎉</p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Last week this Tweet brought us another 300 Twitter followers, catapulting us over 3000! Thanks Community for joining us on this MLOps ride! More to come! 🚀</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Startups I'm *incredibly* bullish about: <a href="https://twitter.com/stripe">@Stripe</a>, <a href="https://twitter.com/Iterativeai">@IterativeAI</a>, <a href="https://twitter.com/huggingface">@HuggingFace</a>, and <a href="https://twitter.com/explosion_ai">@Explosion_AI</a>.<br><br>If you're an engineer/PM considering a career change (and it's that time of the year again, no? 😆)—but want to opt away from FAAMG, definitely consider one of the companies above.</p>— 👩‍💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1435256826375720964">September 7, 2021</a></blockquote> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/road-to-hellhttps://dvc.org/blog/road-to-hellTue, 07 Sep 2021 00:00:00 GMT<p>Machine learning operations (MLOps) in the last year has emerged as a distinct IT discipline for building machine learning (ML) or artificial intelligence (AI) models. While at first blush that may seem like a viable method for automating the building of AI models, in reality purveyors of MLOps platforms have a vested interest in convincing organizations to acquire platforms that exist outside of best DevOps practices that have already been proven to accelerate application development.</p> <p>AI models, however, are ultimately a software artifact like any other that needs to be integrated within an application. The trouble with MLOps as it is most often pursued today is data scientists are constructing AI models in almost complete isolation from the rest of the organization. The hope is that somehow when the AI model is completed it will magically be incorporated into an application development workflow. Unfortunately for all concerned, the rate at which applications are being developed using best DevOps practices rarely align with the rate at which AI models are being constructed.</p> <blockquote> <p>"The trouble with MLOps as it is most often pursued today is data scientists are constructing AI models in almost complete isolation from the rest of the organization."</p> </blockquote> <p>The result is not only a lot of wasted time and effort, the rate at which digital business transformation initiatives that depend on AI models are rolled out becomes a significant competitive disadvantage. In effect, the road to AI hell is paved with good MLOps intentions.</p> <p>While working as a data scientist at Microsoft, I saw firsthand how machine learning and AI was traditionally implemented in an isolated fashion. That unsatisfactory experience led to the launch of opensource Data Version Control (DVC) and Continuous Machine Learning (CML) tools that integrate ML workflows into best practices for software development. Instead of creating a separate proprietary AI platform that needs to be acquired and maintained, the goal needs to be to extend traditional software tools such as Git, collaboration and continuous integration/continuous delivery (CI/CD) platforms to meet the needs of both developers and ML engineers. The entire ML stack needs to be reinvented in a way that makes it accessible to every developer.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fd9d4dc199039488512e2fb94d4bd300/39600/dvc-studio.png" alt="DVC Studio" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>DVC and CML are open source tools, that now along with DVC Studio, streamline the workflow of data scientists. They integrate ML workflows into current practices for software development in a way that eliminates the need for many features of proprietary AI platforms such as AWS SageMaker, Microsoft Azure ML and Google Vertex AI by extending traditional software tools like Git and CI/CD platforms to meet the needs of ML researchers and ML engineers. In essence, they provide an open platform based on best DevOps practices to operationalize ML and AI.</p> <blockquote> <p>"DVC and CML are open source tools that streamline the workflow of data scientists. They integrate ML workflows into current practices for software development in a way that eliminates the need for many features of proprietary AI platforms such as AWS SageMaker, Microsoft Azure ML and Google Vertex AI by extending traditional software tools like Git and CI/CD platforms to meet the needs of ML researchers and ML engineers."</p> </blockquote> <p>MLOps is about operations and automation for ML and AI. It covers the entire lifecycle of an ML process including labeling data, development, modeling, and monitoring. Every ML/AI platform offers this functionality. However, our vision for MLOps is different. We think it should be embedded within your DevOps processes. It should be part of your engineering infrastructure, engineering stack and engineering processes. ML requires additional tools. It’s just those tools need to be incorporated into a larger toolchain.</p> <p>The primary reason to do this is to interact more consistently with people from the software engineering side and to reuse proven tools such as Git, GitHub/GitLab and CI/CD systems. An ML silo that builds an AI model outside the traditional application development process creates a divide that needs to be bridged whenever a data scientist needs to collaborate with engineers. For example, with a traditional AI platform, all the workflows are predefined. There may be some opportunity to modify them, but for all intents and purposes, those workflows are inflexible. That’s the wrong approach. Teams made up of data scientists and developers should be able to define their own workflow based on their business requirements and team preferences, just like they do today when constructing any other software artifact. Rather than a platform forcing teams to embrace a highly opinionated workflow, they can employ flexible tools such Git, GitHub, and their existing CI tools as they see fit.</p> <blockquote> <p>"Teams made up of data scientists and developers should be able to define their own workflow based on their business requirements and team preferences, just like they today when constructing any other software artifact."</p> </blockquote> <h2 id="how-we-do-it" style="position:relative;">How We Do It<a href="#how-we-do-it" aria-label="how we do it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When it comes to software engineering, everything in a workflow is based on the version of the artifact. However, when working with large data sets, that approach doesn’t work because there is no data versioning with existing tools. We extend existing DevOps tools so that developers can version code in addition to ML models.</p> <p>In addition to allowing for data and modeling versioning, we also align data scientists to the CI/CD process. This enables the data scientist to share code and data with other members of the team in a way that actually works on their machines! That’s critical because code is typically run through a third-party platform to determine if it will run in a production environment. There is no way to bring data into this process, which means there’s no real way to determine whether a model works before deploying it. There are no ways to show metrics. There are no ways to compare your metrics with your production metrics. In this scenario, everything needs to be instrumented to attach required plots to test. That takes a lot of time. We enable multiple plot points to be tested. Finally, we provide a place to visualize and analyze data other than employing Microsoft Excel spreadsheets. We extend traditional software engineering functionality by providing a better system to visualize data right on top of your GitHub, GitLab or BitBucket user interface.</p> <blockquote> <p>"We believe an open source-based workflow based on version control and CI tools will streamline machine learning in the same way software development has already been modernized."</p> </blockquote> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We believe an open source-based workflow based on version control and CI tools will streamline machine learning in the same way software development has already been modernized. If data scientists, engineers and developers can accelerate the development of ML/AI models by reusing files, pipelines, experiments and even entire models stored in a Git repository, the rate at which AI will be infused into software will increase by several orders of magnitude and, best of all, the road to AI hell is not taken.</p> <hr> <p><em>This post originally appeared in</em> <a href="https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/" target="_blank" rel="nofollow noopener noreferrer">The New Stack.</a></p>https://dvc.org/blog/august-21-community-gemshttps://dvc.org/blog/august-21-community-gemsTue, 31 Aug 2021 00:00:00 GMT<h3 id="q-are-toml-files-supported-for-storing-model-metrics-and-displaying-them-via-dvc-metrics-show" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/865974923079319563" target="_blank" rel="nofollow noopener noreferrer">Q: Are TOML files supported for storing model metrics and displaying them via <code>dvc metrics show</code>?</a><a href="#q-are-toml-files-supported-for-storing-model-metrics-and-displaying-them-via-dvc-metrics-show" aria-label="q are toml files supported for storing model metrics and displaying them via dvc metrics show permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @naeljaneLiblikas!</p> <p>DVC does not support TOML files for metrics. TOML files are supported for parameters only at the moment.</p> <p>We do have an <a href="https://github.com/iterative/dvc/issues/6402" target="_blank" rel="nofollow noopener noreferrer">open issue</a> for this. Please feel free to add any comments or emojis to this issue so we know how to prioritize it!</p> <h3 id="q-is-there-a-way-to-store-the-results-of-the-experiments-table-in-a-csv-file" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/872554861340803092" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to store the results of the experiments table in a CSV file?</a><a href="#q-is-there-a-way-to-store-the-results-of-the-experiments-table-in-a-csv-file" aria-label="q is there a way to store the results of the experiments table in a csv file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Take a look at the <code>--show-json</code> option of <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>. This will print the table in JSON format and you can write a script to save it to another file.</p> <p>We have an <a href="https://github.com/iterative/dvc/issues/5446" target="_blank" rel="nofollow noopener noreferrer">open feature request</a> to add CSV support. Give us some feedback so we know how to prioritize this on our roadmap!</p> <p>There's another workaround you could test out using our Python API, just keep in mind that it isn't public and it's not as user-friendly as it could be. Although, you can try something like this:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> itertools <span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api exps <span class="token operator">=</span> itertools<span class="token punctuation">.</span>chain<span class="token punctuation">.</span>from_iterable<span class="token punctuation">(</span>dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>Repo<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>experiments<span class="token punctuation">.</span>ls<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>values<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">get_exp_info</span><span class="token punctuation">(</span>exp<span class="token punctuation">)</span><span class="token punctuation">:</span> exp_dict <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"exp"</span><span class="token punctuation">:</span> exp<span class="token punctuation">}</span> <span class="token keyword">with</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"params.yaml"</span><span class="token punctuation">,</span> rev<span class="token operator">=</span>exp<span class="token punctuation">)</span> <span class="token keyword">as</span> p<span class="token punctuation">:</span> params <span class="token operator">=</span> yaml<span class="token punctuation">.</span>load<span class="token punctuation">(</span>p<span class="token punctuation">,</span> Loader<span class="token operator">=</span>yaml<span class="token punctuation">.</span>Loader<span class="token punctuation">)</span> exp_dict<span class="token punctuation">.</span>update<span class="token punctuation">(</span>params<span class="token punctuation">)</span> <span class="token keyword">with</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"scores.json"</span><span class="token punctuation">,</span> rev<span class="token operator">=</span>exp<span class="token punctuation">)</span> <span class="token keyword">as</span> s<span class="token punctuation">:</span> metrics <span class="token operator">=</span> json<span class="token punctuation">.</span>load<span class="token punctuation">(</span>s<span class="token punctuation">)</span> exp_dict<span class="token punctuation">.</span>update<span class="token punctuation">(</span>metrics<span class="token punctuation">)</span> <span class="token keyword">return</span> exp_dict exps_list <span class="token operator">=</span> <span class="token punctuation">[</span>get_exp_info<span class="token punctuation">(</span>exp<span class="token punctuation">)</span> <span class="token keyword">for</span> exp <span class="token keyword">in</span> exps<span class="token punctuation">]</span> df <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">.</span>from_records<span class="token punctuation">(</span>exps_list<span class="token punctuation">)</span></code></pre></div> <p>Great question @Jess_!</p> <h3 id="q-is-there-a-recommended-way-to-specify-multiple-pipelines-in-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/864230750325047316" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a recommended way to specify multiple pipelines in DVC?</a><a href="#q-is-there-a-recommended-way-to-specify-multiple-pipelines-in-dvc" aria-label="q is there a recommended way to specify multiple pipelines in dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You'll want to keep each pipeline in a separate <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> if you want to work with multiple pipelines. This is a recommendation and is not required to specify different pipelines. Here's a bit of explanation:</p> <ul> <li>Splitting a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file into multiple files is encouraged where there are clear logical groupings between stages. It avoids confusion, improves readability, and shortens commands by avoiding long paths preceding every filename.</li> <li><a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files can be in any sub-directory or nested sub-directory in the project structure and DVC will find them.</li> <li>DVC will process them just the same as if they were one DVC file i.e. dependencies between stages in different <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files are still respected.</li> <li>Each <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file will have its own <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file in the same directory.</li> </ul> <p>If you want to see the rest of the explanation, <a href="https://github.com/iterative/dvc.org/issues/2494" target="_blank" rel="nofollow noopener noreferrer">check out this user guide PR we have up</a>. Please feel free to add a comment or emoji on this PR so we know how to prioritize this content for you!</p> <p>Thanks @Tups!</p> <h3 id="q-is-there-way-to-allow-different-pipelines-to-have-common-dependencies-and-outputs-in-dvc-pipelines" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/867747202306146335" target="_blank" rel="nofollow noopener noreferrer">Q: Is there way to allow different pipelines to have common dependencies and outputs in DVC pipelines?</a><a href="#q-is-there-way-to-allow-different-pipelines-to-have-common-dependencies-and-outputs-in-dvc-pipelines" aria-label="q is there way to allow different pipelines to have common dependencies and outputs in dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Good question @vgodie!</p> <p>It is possible to have overlapping dependencies, but not overlapping outputs. Having overlapping outputs introduces uncertainty into DVC commands, like <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>.</p> <p>Sometimes people want to have overlapping directory outputs (different stages that wrote many different files in the same directory). They might have a series of stages that append to the same file. In this case, we suggest creating new files and combining them in a final stage so they are consistently written in the same order.</p> <h3 id="q-how-does-the-cml-runner-restart-workflows-if-its-been-shut-down-by-aws-eg-spot-instances" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/862641924200857660" target="_blank" rel="nofollow noopener noreferrer">Q: How does the CML runner restart workflows if it's been shut down by AWS (e.g. spot instances)?</a><a href="#q-how-does-the-cml-runner-restart-workflows-if-its-been-shut-down-by-aws-eg-spot-instances" aria-label="q how does the cml runner restart workflows if its been shut down by aws eg spot instances permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You shouldn't have to do anything. Spot instances sends a <code>SIGINT</code> that we handle to restart the workflow. We have been supporting graceful shutdown by unregistering runners for a while now.</p> <p>The main difference now is that we restart workflows with unfinished jobs.</p> <p>Thanks for such a good question @andee96!</p> <h3 id="q-can-i-change-an-endpoint-that-is-being-or-does-cml-publish-always-save-the-artifacts-on-this-endpoint" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/864444303169421322" target="_blank" rel="nofollow noopener noreferrer">Q: Can I change an endpoint that is being? Or does <code>cml publish</code> always save the artifacts on this endpoint?</a><a href="#q-can-i-change-an-endpoint-that-is-being-or-does-cml-publish-always-save-the-artifacts-on-this-endpoint" aria-label="q can i change an endpoint that is being or does cml publish always save the artifacts on this endpoint permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Good question @Nwp8nice!</p> <p>If you use GitLab you can use the <code>--native</code> option to upload to GitLab instead.</p> <p>It would be nice to be able to offer an alternative link so if you're interested, a PR for <a href="https://github.com/iterative/cml/issues/291" target="_blank" rel="nofollow noopener noreferrer">this issue</a> would be awesome! 😊</p> <h3 id="q-is-cml-used-for-creating-the-mlops-workflows-like-apache-airflow" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/866624571519664128" target="_blank" rel="nofollow noopener noreferrer">Q: Is CML used for creating the MLOps workflows, like Apache Airflow?</a><a href="#q-is-cml-used-for-creating-the-mlops-workflows-like-apache-airflow" aria-label="q is cml used for creating the mlops workflows like apache airflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a really good question @Ravi Kumar!</p> <p>CML is intended to augment existing CI/CD engines like GitHub Actions or GitLab CI/CD, not replace them. It's a lightweight wrapper and not a complete replacement workflow ecosystem like Airflow. We don't like reinventing working wheels.</p> <h3 id="q-does-cml-have-the-ability-to-cope-with-long-running-instances-eg-launching-an-aws-instance-via-github-actions-that-lasts-more-than-72-hours" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/866730530262351873" target="_blank" rel="nofollow noopener noreferrer">Q: Does CML have the ability to cope with long-running instances, e.g. launching an AWS instance via GitHub Actions that lasts more than 72 hours?</a><a href="#q-does-cml-have-the-ability-to-cope-with-long-running-instances-eg-launching-an-aws-instance-via-github-actions-that-lasts-more-than-72-hours" aria-label="q does cml have the ability to cope with long running instances eg launching an aws instance via github actions that lasts more than 72 hours permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Once the GitHub Actions limit of 72 hours is reached for self-hosted runners, CML will handle restarting the Action and reconnecting to the runner. Meanwhile, on GitLab there is no time limit to circumvent for self-hosted runners.</p> <p>Thanks @sergechuvakin!</p> <hr> <p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p> <p>At our September Office Hours Meetup we will be doing a live demo of running experiments to fine-tune an existing model to work on a different dataset. <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/transfer-learning-experimentshttps://dvc.org/blog/transfer-learning-experimentsTue, 24 Aug 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are plenty of machine learning models available that have been trained to solve one problem and the knowledge gained from that can be applied to a new, yet related problem. For example, a model like AlexNet has been trained on millions of images so you could potentially use this to classify cars, animals, or even people. This is called <a href="https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a" target="_blank" rel="nofollow noopener noreferrer">transfer learning</a> and it can save a lot of time on developing a model from scratch.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/S3Hm_BPLie0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>For us to take advantage of transfer learning, we can use fine-tuning to adopt the model to our new problem. In many cases, we start by replacing the last layer of the model. With the AlexNet example, this might mean the last layer was previously used to classify cars but our new problem is classifying animals.</p> <p>Even though we already have the bulk of the model defined, we'll still have to do some experimentation to determine whether we need to replace more layers in the model or if any other changes need to be made.</p> <p>In this post, we'll go through an example of fine-tuning <a href="https://towardsdatascience.com/alexnet-the-architecture-that-challenged-cnns-e406d5297951" target="_blank" rel="nofollow noopener noreferrer">AlexNet</a> and <a href="https://towardsdatascience.com/review-squeezenet-image-classification-e7414825581a" target="_blank" rel="nofollow noopener noreferrer">SqueezeNet</a> to classify bees and ants. We'll use DVC to handle experiments for us and we'll compare the results of both models at the end.</p> <h2 id="initialize-the-pre-trained-model" style="position:relative;">Initialize the pre-trained model<a href="#initialize-the-pre-trained-model" aria-label="initialize the pre trained model permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We'll be fine-tuning the AlexNet model and the SqueezeNet model to classify images of bees and ants. You can find the project we're working with in <a href="https://github.com/iterative/pretrained-model-demo" target="_blank" rel="nofollow noopener noreferrer">this repo</a>, which is based on the tutorial over at <a href="https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html" target="_blank" rel="nofollow noopener noreferrer">this post</a>.</p> <p>In the <code>pretrained_model_tuner.py</code> file, you'll find the code that defines both the AlexNet and SqueezeNet models. We start by initializing these models so we can get the number of model features and the input size we need for fine-tuning.</p> <p>Since the project has everything we need to initialize the models, we can start training and comparing the differences between them with the ants/bees dataset. Running experiments to get the best tuning for each model can make it difficult to see which changes led to a better result. That's why we will be using DVC to track changes in the code and the data.</p> <h2 id="adding-the-train-stage" style="position:relative;">Adding the train stage<a href="#adding-the-train-stage" aria-label="adding the train stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Stages in DVC let us define individual data processes and can be used to build detailed machine learning pipelines. You have the ability to define the different steps of model creation like preprocessing, featurization, and training.</p> <p>We currently have a <code>train</code> stage in the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. If you take a look at it, you'll see something like:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python pretrained_model_tuner.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data/hymenoptera_data <span class="token punctuation">-</span> pretrained_model_tuner.py <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> lr <span class="token punctuation">-</span> momentum <span class="token punctuation">-</span> model_name <span class="token punctuation">-</span> num_classes <span class="token punctuation">-</span> batch_size <span class="token punctuation">-</span> num_epochs <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">model.pt</span><span class="token punctuation">:</span> <span class="token key atrule">checkpoint</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">live</span><span class="token punctuation">:</span> <span class="token key atrule">results</span><span class="token punctuation">:</span> <span class="token key atrule">summary</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">html</span><span class="token punctuation">:</span> <span class="token boolean important">true</span></code></pre></div> <p>The reason we need this <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file is so DVC knows what to pay attention to in our workflow. It will start managing data, understand which metrics to pay attention to, and what the expected output for each step is.</p> <p>You'll typically add stages to <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> using the <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command and this is one of the ways you can add new stages or update existing ones.</p> <p>With the <code>train</code> stage defined, let's look at where the metrics actually come from in the code. If you open <code>pretrained_model_tuner</code>, you'll see a line where we dump the accuracy and loss for the training epochs into the <code>results.json</code> file. We're also saving the model on the epoch run and recording metrics for each epoch using <code>dvclive</code> logging.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">if</span> phase <span class="token operator">==</span> <span class="token string">'train'</span><span class="token punctuation">:</span> torch<span class="token punctuation">.</span>save<span class="token punctuation">(</span>model<span class="token punctuation">.</span>state_dict<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"model.pt"</span><span class="token punctuation">)</span> dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'acc'</span><span class="token punctuation">,</span> epoch_acc<span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'loss'</span><span class="token punctuation">,</span> epoch_loss<span class="token punctuation">)</span> dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'training_time'</span><span class="token punctuation">,</span> epoch_time_elapsed<span class="token punctuation">)</span> <span class="token keyword">if</span> phase <span class="token operator">==</span> <span class="token string">'val'</span><span class="token punctuation">:</span> dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'val_acc'</span><span class="token punctuation">,</span> epoch_acc<span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> dvclive<span class="token punctuation">.</span>log<span class="token punctuation">(</span><span class="token string">'val_loss'</span><span class="token punctuation">,</span> epoch_loss<span class="token punctuation">)</span> val_acc_history<span class="token punctuation">.</span>append<span class="token punctuation">(</span>epoch_acc<span class="token punctuation">)</span> dvclive<span class="token punctuation">.</span>next_step<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> <p>This code is needed to let DVC access the metrics in the project because it will read the metrics from the <code>dvclive.json</code> file.</p> <p>Since we have several hyperparameters set in the <code>params.yaml</code>, we need to use those values when we run the training stage. The following code makes the hyperparameter values accessible in the <code>train</code> function.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"params.yaml"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> yaml<span class="token operator">=</span>YAML<span class="token punctuation">(</span>typ<span class="token operator">=</span><span class="token string">'safe'</span><span class="token punctuation">)</span> params <span class="token operator">=</span> yaml<span class="token punctuation">.</span>load<span class="token punctuation">(</span>f<span class="token punctuation">)</span></code></pre></div> <p>With all of this in place, we can finally start running experiments to fine-tune the two models.</p> <h2 id="fine-tuning-alexnet" style="position:relative;">Fine-tuning AlexNet<a href="#fine-tuning-alexnet" aria-label="fine tuning alexnet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>You can find the code that initializes the AlexNet model in the <code>initialize_model</code> function in <code>pretrained_model_tuner.py</code>. Since we have DVC set up, we can jump straight into fine-tuning this model to see which hyperparameters give us the best accuracy.</p> <p>We'll run the first experiment with the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>This will execute the <code>pretrained_model_tuner.py</code> script and run for 5 epochs since that's what we defined in <code>params.yaml</code>. When this finishes, you can check out the metrics from this run with the current hyperparameter values.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span></span></code></pre></div> <p>You'll see a table similar to this.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Created<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>num_classes<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>batch_size<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>num_epochs<span class="token hide">**</span></span></span> </span> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> - <span class="token bold"><span class="token hide">**</span>4<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.92623<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.19567<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>229.18<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.9085<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.25145<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>2<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>8<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>01:58 PM<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>2<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>8<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>5<span class="token hide">**</span></span> │ ╓ bf81637 [exp-a1f53] 02:05 PM 4 0.92623 0.19567 229.18 0.9085 0.25145 0.001 0.09 alexnet 2 8 5 │ ╟ 9ca3fb8 02:04 PM 3 0.89344 0.27423 178.34 0.90196 0.26965 0.001 0.09 alexnet 2 8 5 │ ╟ a34ead1 02:03 PM 2 0.87295 0.29018 127.36 0.9085 0.2796 0.001 0.09 alexnet 2 8 5 │ ╟ ae382c7 02:02 PM 1 0.89754 0.26993 76.419 0.89542 0.31113 0.001 0.09 alexnet 2 8 5 ├─╨ a95260d 02:01 PM 0 0.73361 0.5271 25.71 0.86928 0.36408 0.001 0.09 alexnet 2 8 5 </span> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>Now let's update the hyperparameters and run another experiment. There are several ways to do this with DVC:</p> <ul> <li>Change the hyperparameter values directly in <code>params.yaml</code></li> <li>Update the values using the <code>--set-param</code> or the shorthand <code>-S</code> option on <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a></li> <li>Queue multiple experiments with different values using the <code>--queue</code> option on <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a></li> </ul> <p>We'll do an example of each of these throughout the rest of this article.</p> <p>Let's start by updating the hyperparameter values in <code>params.yaml</code>. You should have these values in your file.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">lr</span><span class="token punctuation">:</span> <span class="token number">0.009</span> <span class="token key atrule">momentum</span><span class="token punctuation">:</span> <span class="token number">0.017</span></code></pre></div> <p>Now run another experiment with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>. To make the table more readable, we're going to specify the parameters we want to show and take a look at the metrics with:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token parameter variable">--include-params</span> lr,momentum,model_name</span></code></pre></div> <p>Your table should look something like this now. Since we're using checkpoints, note that we continue training additional epochs on top of your previous experiment. You'll see what it takes to start training from scratch later.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span> </span> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>9<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.91803<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.27989<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>228.59<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.82353<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.69077<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.009<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.017<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> │ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet │ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet │ ╟ 671f8cd 7 0.88934 0.39237 126.7 0.86928 0.47856 0.009 0.017 alexnet │ ╟ ea1bf61 6 0.84836 0.4195 75.834 0.91503 0.30885 0.009 0.017 alexnet │ ╟ a9f8dab (bf81637) 5 0.79508 0.72891 25.219 0.66667 1.0311 0.009 0.017 alexnet │ ╓ bf81637 [exp-a1f53] 4 0.92623 0.19567 229.18 0.9085 0.25145 0.001 0.09 alexnet │ ╟ 9ca3fb8 3 0.89344 0.27423 178.34 0.90196 0.26965 0.001 0.09 alexnet │ ╟ a34ead1 2 0.87295 0.29018 127.36 0.9085 0.2796 0.001 0.09 alexnet │ ╟ ae382c7 1 0.89754 0.26993 76.419 0.89542 0.31113 0.001 0.09 alexnet ├─╨ a95260d 0 0.73361 0.5271 25.71 0.86928 0.36408 0.001 0.09 alexnet </span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>Finding good values for hyperparameters can take a few iterations, even when you're working with a pretrained model. So we'll run one more experiment to fine-tune this AlexNet model. This time we'll do it using the <code>-S</code> option.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">lr</span><span class="token operator">=</span><span class="token number">0.025</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">momentum</span><span class="token operator">=</span><span class="token number">0.5</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">num_epochs</span><span class="token operator">=</span><span class="token number">2</span></span></code></pre></div> <p>The updated table will have values similar to this.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span> </span> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>11<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.88525<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1.1355<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>76.799<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.9085<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1.7642<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.025<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>-<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>alexnet<span class="token hide">**</span></span> │ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet │ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet │ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet │ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet │ ╟ 671f8cd 7 0.88934 0.39237 126.7 0.86928 0.47856 0.009 0.017 alexnet │ ╟ ea1bf61 6 0.84836 0.4195 75.834 0.91503 0.30885 0.009 0.017 alexnet │ ╟ a9f8dab (bf81637) 5 0.79508 0.72891 25.219 0.66667 1.0311 0.009 0.017 alexnet │ ╓ bf81637 [exp-a1f53] 4 0.92623 0.19567 229.18 0.9085 0.25145 0.001 0.09 alexnet</span></code></pre></div> <p>If you take a look at the metrics and the corresponding hyperparameter values, you'll see which direction you should try next with your values. That's one way we can use DVC to fine-tune AlexNet for this particular dataset.</p> <h2 id="fine-tuning-squeezenet" style="position:relative;">Fine-tuning SqueezeNet<a href="#fine-tuning-squeezenet" aria-label="fine tuning squeezenet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We'll switch over to fine-tuning SqueezeNet now that you've seen how the process works in DVC. You'll need to update the <code>model_name</code> hyperparameter in <code>params.yaml</code> to <code>squeezenet</code> if you're following along. The other hyperparameter values can stay the same for now.</p> <p>This is a good time to note that DVC is not only tracking the changes of your hyperparameters for each experiment, it also tracks any code changes and dataset changes as well.</p> <p>Let's run one experiment with <a href="https://dvc.org/doc/command-reference/exp/run#--reset"><code>dvc exp run --reset</code></a> just to show the difference in the metrics between the two models. Remember, since we're using checkpoints it continues training on top of the previous experiment. That's why we're using the <code>--reset</code> option here so that we can start a fresh experiment for the new model. You should see results similar to this in your table.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span> </span> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.85656<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.35667<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>83.414<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87582<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.34273<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.025<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span> │ ╓ 87ccd2e [exp-95f0f] 1 0.85656 0.35667 83.414 0.87582 0.34273 0.025 0.5 squeezenet ├─╨ 7d2fafc 0 0.80328 0.50723 29.165 0.89542 0.3987 0.025 0.5 squeezenet │ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet │ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet │ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet</span></code></pre></div> <p>The newest experiment has an accuracy that's significantly different since we switched models. That tells us that the hyperparameter values that were good for AlexNet might not work the best for SqueezeNet.</p> <p>So we'll need to run a few experiments to find the best hyperparameter values. This time, we'll take advantage of queues in DVC to set up the experiments and then run them at the same time. To set up a queue, we'll run this command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">lr</span><span class="token operator">=</span><span class="token number">0.0001</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">momentum</span><span class="token operator">=</span><span class="token number">0.9</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">num_epochs</span><span class="token operator">=</span><span class="token number">2</span></span></code></pre></div> <p>Running this sets up an experiment for future execution so we'll go ahead a run this command one more time with different values.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">lr</span><span class="token operator">=</span><span class="token number">0.001</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">momentum</span><span class="token operator">=</span><span class="token number">0.09</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">num_epochs</span><span class="token operator">=</span><span class="token number">2</span></span></code></pre></div> <p>You can check out the details for the queues you have in place by looking at the experiments table with <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>. You'll see something like this.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span> </span> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>1<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.85656<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.35667<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>83.414<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.87582<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.34273<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.025<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span> │ ╓ 87ccd2e [exp-95f0f] 1 0.85656 0.35667 83.414 0.87582 0.34273 0.025 0.5 squeezenet ├─╨ 7d2fafc 0 0.80328 0.50723 29.165 0.89542 0.3987 0.025 0.5 squeezenet │ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet │ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet │ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet │ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet │ ╟ 671f8cd 7 0.88934 0.39237 126.7 0.86928 0.47856 0.009 0.017 alexnet │ ╟ ea1bf61 6 0.84836 0.4195 75.834 0.91503 0.30885 0.009 0.017 alexnet ... ├── *2df7fa5 - - - - - - 0.0001 0.9 squeezenet ├── *699dcae - - - - - - 0.001 0.09 squeezenet </span> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>Then you can execute all of the queues with this command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--run-all</span></span></code></pre></div> <p>Now if you take a look at your table, you'll see the metrics from those 3 experiments.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>training_time<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_acc<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>lr<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>momentum<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>model_name<span class="token hide">**</span></span></span> </span> ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>5<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.76639<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.49865<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>85.705<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.81699<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.4518<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>main<span class="token hide">**</span></span> - - - - - - <span class="token bold"><span class="token hide">**</span>0.001<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.09<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>squeezenet<span class="token hide">**</span></span> │ ╓ 699dcae [exp-8322f] 5 0.76639 0.49865 85.705 0.81699 0.4518 0.001 0.09 squeezenet │ ╟ d26c25b (2df7fa5) 4 0.60246 0.68464 29.243 0.69935 0.55156 0.001 0.09 squeezenet │ ╓ 2df7fa5 [exp-d1c65] 3 0.78689 0.488 83.929 0.83007 0.41527 0.0001 0.9 squeezenet │ ╟ 05e1b41 (87ccd2e) 2 0.59016 0.76999 28.455 0.75163 0.49807 0.0001 0.9 squeezenet │ ╓ 87ccd2e [exp-95f0f] 1 0.85656 0.35667 83.414 0.87582 0.34273 0.025 0.5 squeezenet ├─╨ 7d2fafc 0 0.80328 0.50723 29.165 0.89542 0.3987 0.025 0.5 squeezenet │ ╓ 54e87bc [exp-52406] 11 0.88525 1.1355 76.799 0.9085 1.7642 0.025 0.5 alexnet │ ╟ b2b9ad0 (2361cff) 10 0.79098 2.9427 25.715 0.8366 1.4148 0.025 0.5 alexnet │ ╓ 2361cff [exp-c0b11] 9 0.91803 0.27989 228.59 0.82353 0.69077 0.009 0.017 alexnet │ ╟ 7686d2f 8 0.90984 0.23496 177.65 0.87582 0.50887 0.009 0.017 alexnet</span></code></pre></div> <p>Then you'll be able to make a decision on which way to go with your fine-tuning efforts and make a decision on which model works best for your project. In this case, it seems like SqueezeNet might be the winner!</p> <p>You can take all of the DVC setup and apply this to your own custom fine-tuning use case.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When you're working with pretrained models, it can be hard to fine-tune them to give you the results you need. You might end up replacing the last layer of the model to fit your problem or you might need to dig deeper. Then you have to consider updating the hyperparameter values until you get the best model you can.</p> <p>That's why it's important to research tools that make this process more efficient. Using DVC to help with this kind of experimentation will give you the ability to reproduce any experiment you run, making it easier to collaborate with others on a project. It will also help you keep track of what you've already tried in previous experiments.</p>https://dvc.org/blog/august-21-dvc-heartbeathttps://dvc.org/blog/august-21-dvc-heartbeatTue, 17 Aug 2021 00:00:00 GMT<h1 id="its-all-about-that-data" style="position:relative;">It's all about that Data!<a href="#its-all-about-that-data" aria-label="its all about that data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/4FQMuOKR6zQRO/giphy.gif" alt="Data! Data! Data!"></p> <h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>This month we are seeing the progression of a couple of pieces from the <a href="https://media.giphy.com/media/62HBhssMOgdJUZQp1X/giphy.gif" target="_blank" rel="nofollow noopener noreferrer">June Heartbeat</a> as well as checking out a use case, tool stack, and some great tutorials of our Community members.</p> <h2 id="lj-miranda-synthesizes-the-mlops-space-once-again" style="position:relative;">LJ Miranda synthesizes the MLOps space once again!<a href="#lj-miranda-synthesizes-the-mlops-space-once-again" aria-label="lj miranda synthesizes the mlops space once again permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/ljvmiranda921" target="_blank" rel="nofollow noopener noreferrer">LJ Miranda</a> writes another amazing article after the series of articles he wrote covering the MLOps tools landscape we covered in the June Heartbeat. This time he focuses on the wave of data-centric focus taking over the space giving a review of the methods, approaches, and techniques to ensure quality data for ML projects. If the adroit summaries of complex concepts doesn't thrill you, the links to no less than 63 (😱) resources will get you on your way to data-centric nirvana.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 662px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6cfa523455e454fb01e9f7fabb1cf96f/39600/lj-miranda-data-centric.png" alt="Data Centric Framework" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>LJ Miranda's Framework for putting data-centric machine learning into context <a href="https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h2 id="neda-sultovas-comparison-of-dvc-mlflow-and-metaflow" style="position:relative;">Neda Sultova's Comparison of DVC, MLFlow and Metaflow<a href="#neda-sultovas-comparison-of-dvc-mlflow-and-metaflow" aria-label="neda sultovas comparison of dvc mlflow and metaflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Also covered in the June Hearbeat was <a href="https://www.linkedin.com/in/neda-sultova-597a811a8/" target="_blank" rel="nofollow noopener noreferrer">Neda Sultova's</a> piece on the rubric she is using to decide on the what MLOps tools to use for the teams at <a href="https://www.helmholtz.ai/" target="_blank" rel="nofollow noopener noreferrer">Helmholtz AI</a>. This <a href="https://medium.com/geekculture/comparing-metaflow-mlflow-and-dvc-e84be6db2e2" target="_blank" rel="nofollow noopener noreferrer">next article</a> reviews her research into DVC, MLFlow and Metaflow and offers a thorough analysis of the tools across multiple dimensions. Beyond the article, check out her <a href="https://github.com/hzdr/mlops_comparison" target="_blank" rel="nofollow noopener noreferrer">MLOps Comparison repository</a> as well as her <a href="https://github.com/hzdr/mlops_comparison/blob/master/Content/Comparison_table.pdf" target="_blank" rel="nofollow noopener noreferrer">Comparison Table</a>. They will not disappoint!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 454px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/24f89675d24a0316c700db40eee9b0f2/39600/neda-sultova-2.png" alt="Machine Learning Lifecycle" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Machine Learning Lifecycle <a href="https://medium.com/geekculture/comparing-metaflow-mlflow-and-dvc-e84be6db2e2" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h2 id="amit-kulkarnis-tutorials" style="position:relative;">Amit Kulkarni's Tutorials<a href="#amit-kulkarnis-tutorials" aria-label="amit kulkarnis tutorials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Writing for the <a href="https://datahack.analyticsvidhya.com/contest/data-science-blogathon-9/#LeaderBoard" target="_blank" rel="nofollow noopener noreferrer">Analytics Vidhya Data Science Blogathon,</a> <a href="https://www.linkedin.com/in/amitvkulkarni2/" target="_blank" rel="nofollow noopener noreferrer">Amit Kulkarni</a> created two tutorials on DVC. <a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-tracking-ml-experiments-with-data-version-control/?utm_source=dlvr.it&utm_medium=twitter" target="_blank" rel="nofollow noopener noreferrer">Tracking ML Experiments with Data Version Control</a> reviews DVC and takes you through getting started, setup, fetching data and pre-processing, and the steps of an ML project. Next it sets up DVC, the pipeline, and shows how to run model metrics and plots. In <a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-versioning-datasets-with-git-dvc/" target="_blank" rel="nofollow noopener noreferrer">MLOps| Versioning with Git & DVC,</a> Amit continues with an explanation how data and model versioning works with Github paired with DVC.</p> <p>In a previous article entitled <a href="https://www.analyticsvidhya.com/blog/2021/04/bring-devops-to-data-science-with-continuous-mlops/" target="_blank" rel="nofollow noopener noreferrer">Bring DevOps to Data Science with MLOps</a> Amit walks through a tutorial using CML to bring CI/CD functionality to your ML project and automate the process. All great posts to check out!👇🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-tracking-ml-experiments-with-data-version-control/?utm_source=dlvr.it&utm_medium=twitter" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Tracking ML Experiments With Data Version Control</h4> <div class="elp-description">Amit Kulkarni's tutorial on getting started with DVC and tracking eperiments</div> <div class="elp-link">https://analyticsvidhya.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-08-17/a-v-8059e54b05396a5537a69588b79d36c7.png" alt="Tracking ML Experiments With Data Version Control"> </div> </a> </section> <section class="elp-content-holder"> <a href="https://www.analyticsvidhya.com/blog/2021/06/mlops-versioning-datasets-with-git-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLOps | Versioning Datasets with Git & DVC</h4> <div class="elp-description">Amit Kulkarni's tutorial on how to DVC works with Git to version your datasets.</div> <div class="elp-link">https://analyticsvidhya.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-08-17/a-v-8059e54b05396a5537a69588b79d36c7.png" alt="MLOps | Versioning Datasets with Git & DVC"> </div> </a> </section> <section class="elp-content-holder"> <a href="https://www.analyticsvidhya.com/blog/2021/04/bring-devops-to-data-science-with-continuous-mlops/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Bring DevOps To Data Science With MLOps</h4> <div class="elp-description">Amit Kulkarni's tutorial on how to use CML to bring the CI/CD functionality of DevOps to your data science projects.</div> <div class="elp-link">https://analyticsvidhya.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-08-17/a-v-8059e54b05396a5537a69588b79d36c7.png" alt="Bring DevOps To Data Science With MLOps"> </div> </a> </section> <p></p> <h2 id="andreas-malekos-mlops-tool-stack-at-continuum-industries" style="position:relative;">Andreas Malekos' MLOps Tool Stack at Continuum Industries<a href="#andreas-malekos-mlops-tool-stack-at-continuum-industries" aria-label="andreas malekos mlops tool stack at continuum industries permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Last but not least, we bring you a great article from <a href="https://www.linkedin.com/in/andreasmalekos/" target="_blank" rel="nofollow noopener noreferrer">Andreas Malekos</a>, Chief Scientist at <a href="https://www.continuum.industries/" target="_blank" rel="nofollow noopener noreferrer">Continuum Industries</a>. In <a href="https://neptune.ai/blog/mlops-tool-stack-continuum-industries" target="_blank" rel="nofollow noopener noreferrer">the post</a> he outlines the tool stack and MLOps platform they use to do their work automating and optimizing the design of linear infrastructure assets like water pipelines, overhead transmission lines, subsea power lines, or telecommunication cables.</p> <p>Amongst their tool stack are DVC and CML, and the article outlines what they like (!🙈Spoiler alert🙊! DVC making repeatability achievable) and the things that they don't like that still need to be improved.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 670.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ec7c9901d7fcae60af4221b7fc2796d2/39600/continuum-tool-stack.png" alt="Continuum Industries MLOps Tool Stack" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Continuum Industries MLOps Tool Stack <a href="https://neptune.ai/wp-content../uploads/Continuum-Industries-tool-stack-final.png" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Though the team has been taking some vacation time in the last month, there's still a lot going on!</p> <p><img src="https://media.giphy.com/media/aNqEFrYVnsS52/giphy.gif" alt="Typing Cat"></p> <h2 id="docs-updates" style="position:relative;">Docs Updates<a href="#docs-updates" aria-label="docs updates permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This month we are introducing docs updates so that you will always be aware of what has changed as our open source projects mature.</p> <p>Our docs team made up of <a href="https://www.linkedin.com/in/jorgeorpinel/" target="_blank" rel="nofollow noopener noreferrer">Jorge Orpinel</a>, <a href="https://emresahin.net" target="_blank" rel="nofollow noopener noreferrer">Emre Şahin</a>, <a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">Casper da Costa-Luis</a>, and <a href="https://www.linkedin.com/in/david-de-la-iglesia-castro-b4b67b20a/" target="_blank" rel="nofollow noopener noreferrer">David de la Iglesia-Castro,</a> has been hard at work updating our docs to make sure you have what you need to be successful using our tools! Updates include:</p> <ul> <li>Complete <a href="https://dvc.org/doc/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVCLive docs</a></li> <li>We have a new <a href="https://dvc.org/doc/user-guide/glossary" target="_blank" rel="nofollow noopener noreferrer">Glossary page</a> and a first Basic Concepts page (<a href="https://dvc.org/doc/user-guide/basic-concepts/workspace" target="_blank" rel="nofollow noopener noreferrer"><em>DVC Workspace</em></a>)</li> <li><a href="https://cml.dev/doc" target="_blank" rel="nofollow noopener noreferrer">CML Docs migration to CML.Dev</a></li> <li><a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">Added Videos to Get Started: Metrics and Experiments pages</a> and <a href="https://dvc.org/doc/user-guide/experiment-management/checkpoints" target="_blank" rel="nofollow noopener noreferrer">Checkpoints Guide</a></li> <li>Authentication examples for <a href="https://dvc.org/doc/command-reference/remote/modify#example-some-azure-authentication-methods" target="_blank" rel="nofollow noopener noreferrer">Azure Blob remote storage</a> from Community member @meierale ❤️</li> </ul> <h2 id="batuhan-taskayas-refactor-project-hits-first-page-in-hackernews" style="position:relative;">Batuhan Taskaya's Refactor Project hits First Page in HackerNews!<a href="#batuhan-taskayas-refactor-project-hits-first-page-in-hackernews" aria-label="batuhan taskayas refactor project hits first page in hackernews permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A <a href="https://github.com/isidentical/refactor" target="_blank" rel="nofollow noopener noreferrer">Refactor Project</a> created by team Member <a href="https://twitter.com/isidentical" target="_blank" rel="nofollow noopener noreferrer">Batuhan Taskaya</a> (AKA @isidentical), was shared by someone on HackerNews and made it to the main page! You can <a href="https://news.ycombinator.com/item?id=28027016" target="_blank" rel="nofollow noopener noreferrer">catch all the comments here</a>!</p> <p>Explanation of the project:</p> <blockquote> <p>refactor is an end-to-end refactoring framework that is built on top of the 'simple but effective refactorings' assumption. It is much easier to write a simple script with it rather than trying to figure out what sort of a regex you need in order to replace a pattern (if it is even matchable with regexes).</p> </blockquote> <blockquote> <p>Every refactoring rule offers a single entrypoint, match(), where they accept an AST node (from the ast module in the standard library) and respond with either returning an action to refactor or nothing. If the rule succeeds on the input, then the returned action will build a replacement node and refactor will simply replace the code segment that belong to the input with the new version.</p> </blockquote> <p>Way to go Batuhan! 🚀</p> <h2 id="july-office-hour-meetup" style="position:relative;">July Office Hour Meetup<a href="#july-office-hour-meetup" aria-label="july office hour meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you missed our July Office Hours, good news! It's now available on our <a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube Channel</a> and you can see <a href="https://twitter.com/jcpsantiago" target="_blank" rel="nofollow noopener noreferrer">João Santiago</a> shares about {dvthis}, and how his team at <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie.io</a> uses DVC to productionize rstats.</p> <p>Also in the Meetup is a DVC Studio demo by <a href="https://www.linkedin.com/in/tapa-dipti-sitaula/" target="_blank" rel="nofollow noopener noreferrer">Tapa Dipti Situala</a>, Senior Product Engineer for Studio. You can catch the presentations along with great questions and discussion from the Community!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/H22j1lWIvMw?rel=0&%3B=&%3Bshowinfo=0%3B&start=1546" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>So remember when I told you last month about DVC + Streamlit = ❤️ ? Well at our August Office Hours Meetup, <a href="https://www.linkedin.com/in/antoine-toubhans-92262119/" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhans</a> of <a href="https://www.sicara.fr/" target="_blank" rel="nofollow noopener noreferrer">Sicara</a> will be presenting <a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer">his tutorial</a> on how to do just that! Join us in the integrating fun on August 19th at 3:00 pm UTC! RSVP at this link below! 👇🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279723437/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC Office Hours - DVC and Streamlit Integration</h4> <div class="elp-description">Antoine Toubhans of Sicara shares his tutorial for using Streamlit with DVC to create a customizable web UI</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-08-17/streamlit-oh-0f211180ca12528deb0318d283d7886d.png" alt="DVC Office Hours - DVC and Streamlit Integration"> </div> </a> </section> <p></p> <h2 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This week's DVC Learn Meetup (August 18th) will be the last in our series of DVC Learn Meetups designed to get teams up and running with DVC. We will digest our learnings from this first cohort and revamp for the next set of three classes that will begin in September. Subscribe to <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">our Meetup group</a> and and follow us in <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and <a href="https://www.linkedin.com/company/18657719" target="_blank" rel="nofollow noopener noreferrer">LinkedIn</a> to stay in the know about all of our upcoming events!</p> <p>If you are interested in weighing in on what kinds of educational content you would like to see from us, we'd be grateful if you'd fill out <a href="https://docs.google.com/forms/d/e/1FAIpQLSdmwjs0ZkxDdODfZTvSwP2bVW4JAVVdxiYhQPyW5dSbsZC8qg/viewform?pli=1" target="_blank" rel="nofollow noopener noreferrer"><strong>this survey</strong></a> to help us plan! 🙏🏼</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 676.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/83fdf48f7311c67c558afe07fa5a639b/39600/survey.png" alt="DVC Online Course survey" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Help us plan our Online Course! 🙏🏼 <a href="https://docs.google.com/forms/d/e/1FAIpQLSdmwjs0ZkxDdODfZTvSwP2bVW4JAVVdxiYhQPyW5dSbsZC8qg/viewform?pli=1" target="_blank" rel="nofollow noopener noreferrer">Source link</a>)</em></p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Looking for a great opportunity at an amazing company? Check out our open postions <a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">at this link</a> to find details of all the positions including:</p> <ul> <li>Senior Front-End Engineer (TypeScript, Node, React)</li> <li>Senior Software Engineer (ML, Dev Tools, Python)</li> <li>Senior Software Engineer (ML, Data Infra, GoLang)</li> <li>Machine Learning Engineer/Field Data Scientist</li> <li>Developer Advocate (ML)</li> <li>Director/VP of Engineering (ML, DevTools)</li> <li>Director/VP of Product (ML, Data Infra, SaaS)</li> <li>Director/VP of Operations/Chief of Staff</li> </ul> <p>Please pass this info on to anyone you know that may fit the bill. We look forward to new team members! 🎉</p> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/july-21-dvc-community-gemshttps://dvc.org/blog/july-21-dvc-community-gemsTue, 27 Jul 2021 00:00:00 GMT<h3 id="q-im-trying-to-use-the---reuse-option-of-cml-runner-if-i-launch-2-cml-experiments-in-parallel-will-cml-use-the-same-runner-or-spin-up-another-one-if-the-existing-one-is-in-use" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/850340190434492445" target="_blank" rel="nofollow noopener noreferrer">Q: I'm trying to use the <code>--reuse</code> option of <code>cml runner</code>. If I launch 2 CML experiments in parallel, will CML use the same runner or spin up another one if the existing one is in use?</a><a href="#q-im-trying-to-use-the---reuse-option-of-cml-runner-if-i-launch-2-cml-experiments-in-parallel-will-cml-use-the-same-runner-or-spin-up-another-one-if-the-existing-one-is-in-use" aria-label="q im trying to use the reuse option of cml runner if i launch 2 cml experiments in parallel will cml use the same runner or spin up another one if the existing one is in use permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you don't reuse the runner and you have set up a deploy job, that deploy job will launch two cloud runners. With <code>--reuse</code> it will check if the runner with that tag exists and will not launch another one. Every runner will be listening for incomming jobs until the max idle time.</p> <p>Let's say that you set up one runner with <code>--reuse</code> and launch multiple jobs. What will happen is that only one runner should be launched and will take all the jobs.</p> <p>The runner that deploys the workflow is not tied specifically to the train job that it's going to be launched in the same workflow. You just add runners to the pool and they will be waiting until the idle time is done.</p> <p>We're working on something like <code>--reuse-idle</code> that would be easy to implement. The idea would be to reuse only idle runners, so that if your job fails and the fix is pretty fast, you don't need to spin up another runner. You can track our progress on that through <a href="https://github.com/iterative/cml/issues/575" target="_blank" rel="nofollow noopener noreferrer">this GitHub issue</a>.</p> <p>A great question from @Corentin in the Discord community!</p> <h3 id="q-how-can-i-run-self-hosted-runners-on-an-on-premise-machine-indefinitely" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/851923384613994496" target="_blank" rel="nofollow noopener noreferrer">Q: How can I run self-hosted runners on an on-premise machine indefinitely?</a><a href="#q-how-can-i-run-self-hosted-runners-on-an-on-premise-machine-indefinitely" aria-label="q how can i run self hosted runners on an on premise machine indefinitely permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can achieve this by passing the <code>--idle-timeout=0</code> option to <code>cml runner</code> in order to disable the timeout.</p> <p>Thanks @achbogga!</p> <h3 id="q-how-can-i-change-the-default-vpc-to-a-different-one-with-cml-runner-for-aws" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/857940793616498738" target="_blank" rel="nofollow noopener noreferrer">Q: How can I change the default VPC to a different one with <code>cml-runner</code> for AWS?</a><a href="#q-how-can-i-change-the-default-vpc-to-a-different-one-with-cml-runner-for-aws" aria-label="q how can i change the default vpc to a different one with cml runner for aws permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Great gem from @krish98409!</p> <p>You could setting the security group via <code>cloud-aws-security-group</code>. It will pick the VPC that manages that precise security group.</p> <p>We still don't provide a way of specifying VPCs other than the default one, but it's an issue that we're currently working on: <a href="https://github.com/iterative/terraform-provider-iterative/issues/107" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/terraform-provider-iterative/issues/107</a></p> <h3 id="q-is-it-possible-to-rename-and-modify-a-file-inside-a-directory-tracked-by-dvc-in-one-commitchange" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/849589484517588992" target="_blank" rel="nofollow noopener noreferrer">Q: Is it possible to rename and modify a file inside a directory tracked by DVC in one commit/change?</a><a href="#q-is-it-possible-to-rename-and-modify-a-file-inside-a-directory-tracked-by-dvc-in-one-commitchange" aria-label="q is it possible to rename and modify a file inside a directory tracked by dvc in one commitchange permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you modify the name and modify the file, you just need to run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> and then commit the change into Git.</p> <p>This was a good question for everyone. Thanks @snowpong!</p> <h3 id="q-how-can-i-list-the-experiments-ive-queued" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/856882434138570753" target="_blank" rel="nofollow noopener noreferrer">Q: How can I list the experiments I've queued?</a><a href="#q-how-can-i-list-the-experiments-ive-queued" aria-label="q how can i list the experiments ive queued permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a great question to help us all understand something so thanks @adwivedi.</p> <p>To look at your queued experiments, run <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>. All of the queued experiments will be marked with an asterisk <code>*</code>.</p> <p><em>Queued experiments are not shown with the <a href="https://dvc.org/doc/command-reference/exp/list"><code>dvc exp list</code></a> command at the moment.</em></p> <h3 id="q-i-have-two-machines-and-a-central-remote-with-my-second-machine-i-want-to-pull-the-dataset-from-the-first-machine-how-can-i-pull-the-data-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/859034882297823233" target="_blank" rel="nofollow noopener noreferrer">Q: I have two machines and a central remote. With my second machine, I want to pull the dataset from the first machine. How can I pull the data with DVC?</a><a href="#q-i-have-two-machines-and-a-central-remote-with-my-second-machine-i-want-to-pull-the-dataset-from-the-first-machine-how-can-i-pull-the-data-with-dvc" aria-label="q i have two machines and a central remote with my second machine i want to pull the dataset from the first machine how can i pull the data with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Make sure that you have configured a DVC remote and run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> from your first machine. You should be able to find the files on the remote storage where you pushed them to after running that command. Then you can run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> on your second machine and this should give you the dataset you pushed from the first machine.</p> <p>You will run into some issues if your remote isn't configured properly on the second machine. Check your <code>.dvc/config</code> file for the second machine to make sure there aren't any errors. It could be something as simple as a connection string without the necessary quotation marks!</p> <p>Thanks so much for this question @raharth!</p> <h3 id="q-dvc-push-says-everything-is-up-to-date-however-i-modified-my-dataset-and-this-is-confirmed-with-dvc-status-where-it-lists-a-modified-entry-on-the-changed-outs-how-can-i-force-a-push-of-my-changes" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/857931383476977695" target="_blank" rel="nofollow noopener noreferrer">Q: <code>dvc push</code> says, "Everything is up to date." However, I modified my dataset and this is confirmed with <code>dvc status</code>, where it lists a "modified" entry on the changed outs. How can I force a push of my changes?</a><a href="#q-dvc-push-says-everything-is-up-to-date-however-i-modified-my-dataset-and-this-is-confirmed-with-dvc-status-where-it-lists-a-modified-entry-on-the-changed-outs-how-can-i-force-a-push-of-my-changes" aria-label="q dvc push says everything is up to date however i modified my dataset and this is confirmed with dvc status where it lists a modified entry on the changed outs how can i force a push of my changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You need to run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> to commit your changes to the cache.</p> <p>Good question @BSVogler.</p> <h3 id="q-im-trying-to-use-the-dvc-api-in-a-jupyter-notebook-can-i-simulate-a-dvc-push-command-via-the-api" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/856979475068878898" target="_blank" rel="nofollow noopener noreferrer">Q: I'm trying to use the DVC API in a Jupyter notebook. Can I simulate a <code>dvc push</code> command via the API?</a><a href="#q-im-trying-to-use-the-dvc-api-in-a-jupyter-notebook-can-i-simulate-a-dvc-push-command-via-the-api" aria-label="q im trying to use the dvc api in a jupyter notebook can i simulate a dvc push command via the api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Nice job working with the Python API @harry134!</p> <p>You can use the <code>Repo</code> API like this.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>repo <span class="token keyword">import</span> Repo repo <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token punctuation">)</span> repo<span class="token punctuation">.</span>push<span class="token punctuation">(</span><span class="token punctuation">)</span></code></pre></div> <p>The API isn't production ready, so documentation is lacking at the moment. Although, we do use it internally all the time, so you can use it with caution too.</p> <hr> <p><img src="https://media.giphy.com/media/l0Iyl55kTeh71nTXy/giphy.gif" alt="Done GIF by Quizizz"></p> <p>At our August Office Hours Meetup, we'll be learning about DVC and Streamlit integration. <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279723437/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get answers for your DVC and CML questions!</p>https://dvc.org/blog/hyperparam-tuninghttps://dvc.org/blog/hyperparam-tuningMon, 19 Jul 2021 00:00:00 GMT<h2 id="intro" style="position:relative;">Intro<a href="#intro" aria-label="intro permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>When you're starting to build a new machine learning model and you're deciding on the model architecture, there are a number of issues that arise. You have to monitor code changes you make, note any differences in the data you've used for training, and keep up with hyperparameter value updates.</p> <p>Being able to track all of these changes is important so that you can reproduce your experiments without wondering which changes gave you the best model. You can go back to any point in your experimenting process to see which changes gave you the best results.</p> <p>In this post, we're going to go through an example of hyperparameter tuning with reproducibility using DVC. You can add this to any existing project you're working on or start from a fresh project.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/W48Tvx2p-xE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="background-on-hyperparameters" style="position:relative;">Background on Hyperparameters<a href="#background-on-hyperparameters" aria-label="background on hyperparameters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Before jumping straight into training and experiments, let's briefly go over some background on hyperparameters. <a href="https://dvc.org/doc/command-reference/params" target="_blank" rel="nofollow noopener noreferrer">Hyperparameters</a> are the values that define your model. This includes things like the number of layers in a neural network or the learning rate for gradient descent.</p> <p>These parameters are different from model parameters because we can't get them from training our model. They are used to <em>create</em> the model we train with. Optimizing these values means running training steps for different kinds of models to see how accurate the results are. We can get the best model from iterating through different hyperparameter values and seeing how they effect our accuracy.</p> <p>That's why we do hyperparameter tuning. There are a couple common methods that we'll do some code examples with: grid search and random search.</p> <h2 id="tuning-with-dvc" style="position:relative;">Tuning with DVC<a href="#tuning-with-dvc" aria-label="tuning with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Let's start by talking about DVC a bit because we'll be using it to add reproducibility to our tuning process. This is the tool we'll be using to track changes in our data, code, and hyperparameters. With DVC, we can add some automation to the tuning process and be able to find and restore any really good models that emerge.</p> <p>A few things DVC makes easier to do:</p> <ul> <li>Letting you make changes without worrying about finding them later</li> <li>Onboarding other engineers to a project</li> <li>Sharing experiments with other engineers on different machines</li> </ul> <p>For hyperparameter tuning, this means you can play with their values without losing track of which changes made the best model and also have other engineers take a look. We'll do an example of this with grid search in DVC first.</p> <h2 id="working-with-a-dvc-project" style="position:relative;">Working with a DVC project<a href="#working-with-a-dvc-project" aria-label="working with a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We're going to be working with an existing NLP project. You can <a href="https://github.com/iterative/example-get-started" target="_blank" rel="nofollow noopener noreferrer">get the code we're working with in this repo</a>. It already has DVC set up, but you can check out <a href="https://dvc.org/doc/start" target="_blank" rel="nofollow noopener noreferrer">the Get Started docs</a> if you want to know how the DVC pipeline was created.</p> <p>First make sure you're in a virtual environment with a command similar to this.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> <span class="token parameter variable">-m</span> venv .venv</span></code></pre></div> <p>After you've cloned the repo, install all of the dependencies with this command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">-r</span> requirements.txt</span></code></pre></div> <p>You should be able to open your terminal and run an experiment with the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span></span></code></pre></div> <p>This will trigger the training process to run and it will record the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html" target="_blank" rel="nofollow noopener noreferrer">ROC-AUC</a> of your model. You can check out the results of your experiment with the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-timestamp</span> <span class="token parameter variable">--include-params</span> train.n_est,train.min_split</span></code></pre></div> <p><em>We're adding a few options here to make the table view clearer. We aren't showing timestamps and we're only looking at two hyperparameter values. You can run <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a> without the options to see the entire table.</em></p> <p>This will produce a table similar to this.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> </span> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.51682<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.93819<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>175<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>master<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.56447<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.94713<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> └── a1e8716 [exp-09074] 0.57333 0.94801 100 32 </span> ──────────────────────────────────────────────────────────────────────────────</code></pre></div> <h3 id="start-tuning-with-grid-search" style="position:relative;">Start tuning with grid search<a href="#start-tuning-with-grid-search" aria-label="start tuning with grid search permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <admon type="tip"> <p>Starting with DVC <code>2.25.0</code>, you can peform a Grid Search directly using <code>exp run --set-param</code>. See the <a href="https://dvc.org/doc/command-reference/exp/run#example-grid-search" target="_blank" rel="nofollow noopener noreferrer">example in the command reference</a>.</p> </admon> <p>Now that you've seen how to run an experiment, we're going to write a small script to automate grid search for us using DVC. Using grid search in hyperparameter tuning means you have an exhaustive list of hyperparameter values you want to cycle through. Grid search will cover every combination of those hyperparameter values.</p> <p>We'll do this by creating queues. A queue is how DVC allows us to create experiments that won't be run until later. That way we can cycle through multiple hyperparameters quickly instead of manually updating a config file with new hyperparameter values for each experiment run. The command syntax for creating queues looks like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">train.min_split</span><span class="token operator">=</span><span class="token number">8</span></span></code></pre></div> <p>In the example queue above, we're updating the <code>train.min_split</code> value that's inside of the <code>params.yaml</code> file. This file holds all of the hyperparameter values and is where DVC looks to determine if any values have changed. With the command above, we're automatically updating that value in the <code>params.yaml</code> using a queued experiment.</p> <p>Now we can make the script. You can add a new file to the <code>src</code> directory called <code>grid_search.py</code>. Inside of the file, add the following code.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> itertools <span class="token keyword">import</span> subprocess <span class="token comment"># Automated grid search experiments</span> n_est_values <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">250</span><span class="token punctuation">,</span> <span class="token number">300</span><span class="token punctuation">,</span> <span class="token number">350</span><span class="token punctuation">,</span> <span class="token number">400</span><span class="token punctuation">,</span> <span class="token number">450</span><span class="token punctuation">,</span> <span class="token number">500</span><span class="token punctuation">]</span> min_split_values <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">128</span><span class="token punctuation">,</span> <span class="token number">256</span><span class="token punctuation">]</span> <span class="token comment"># Iterate over all combinations of hyperparameter values.</span> <span class="token keyword">for</span> n_est<span class="token punctuation">,</span> min_split <span class="token keyword">in</span> itertools<span class="token punctuation">.</span>product<span class="token punctuation">(</span>n_est_values<span class="token punctuation">,</span> min_split_values<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># Execute "dvc exp run --queue --set-param train.n_est=<n_est> --set-param train.min_split=<min_split>".</span> subprocess<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"dvc"</span><span class="token punctuation">,</span> <span class="token string">"exp"</span><span class="token punctuation">,</span> <span class="token string">"run"</span><span class="token punctuation">,</span> <span class="token string">"--queue"</span><span class="token punctuation">,</span> <span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.n_est=</span><span class="token interpolation"><span class="token punctuation">{</span>n_est<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.min_split=</span><span class="token interpolation"><span class="token punctuation">{</span>min_split<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div> <p>This is a simple grid search. We have two hyperparameters we want to tune: <code>n_est</code> and <code>min_split</code>. So we have arrays with a few values in them to mimic the exhaustive search a grid search can handle. Then we loop through the values and create queued experiments for them using <code>subprocess</code>.</p> <p>You can run this script now and generate your queue with this command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> src/grid_search.py</span></code></pre></div> <p>You'll see some outputs in the terminal telling you that your experiments have been queued. Then you can run them all with the following command.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--run-all</span></span></code></pre></div> <p>This will run every experiment that has been queued. Once all of those have run, take a look at your metrics for each experiment.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--include-params</span><span class="token operator">=</span>train.min_split,train.n_est <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <p>Your table should look similar to this when you run the command above. We've included the <code>--include-params</code> and <code>--no-timestamp</code> options to give us a table that's easier to read.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> </span> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>try-large-dataset<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span> ├── 4899d41 [exp-ae5ed] 0.6888 0.97028 8 250 ├── bcdd8ed [exp-56613] 0.68733 0.96773 16 250 ├── 703f20b [exp-caa84] 0.68942 0.9698 32 250 ├── 1a882e6 [exp-c208f] 0.681 0.96772 64 250 ├── 3ac33fb [exp-4c53e] 0.67775 0.96664 128 250 ├── ea90ee0 [exp-fdb47] 0.65382 0.96719 256 250 ├── b8277b1 [exp-3fb5c] 0.68547 0.97011 8 300 ├── 7be641e [exp-3bbbc] 0.6883 0.96724 16 300 ├── 4202757 [exp-38ca4] 0.68808 0.96968 32 300 ├── b71ee2f [exp-5384b] 0.68111 0.96848 64 300 ├── 1bbb0f4 [exp-f5d54] 0.67707 0.96753 128 300 ├── 71ba159 [exp-31749] 0.65282 0.96752 256 300 ├── 836c1c5 [exp-2ce0a] 0.68758 0.96998 8 350 ├── dac9e22 [exp-5c799] 0.68778 0.96779 16 350</span></code></pre></div> <p>Now you can see how your precision changed with each hyperparameter value update. This is a quick implementation of grid search in DVC. You could read the hyperparameter values from a different file or data source or make this tuning script as fancy as you like. The main thing you need is the <a href="https://dvc.org/doc/command-reference/exp/run#--queue"><code>dvc exp run --queue --set-param <param></code></a> command to execute when you add new values.</p> <h3 id="random-search" style="position:relative;">Random search<a href="#random-search" aria-label="random search permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Another commonly used method for tuning hyperparameters is random search. This takes random values for hyperparameters and builds the model with them. It usually takes less time than an exhaustive grid search and it can perform better if run for a similar amount of time as a grid search.</p> <p>We're going to add a example of random search in a new file called <code>random_search.py</code> simialr to the file we created for grid search. This will add queued experiments with the randomly selected hyperparameter values. Add the following code to <code>random_search.py</code>.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> subprocess <span class="token keyword">import</span> random <span class="token comment"># Automated random search experiments</span> num_exps <span class="token operator">=</span> <span class="token number">10</span> random<span class="token punctuation">.</span>seed<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token keyword">for</span> _ <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>num_exps<span class="token punctuation">)</span><span class="token punctuation">:</span> params <span class="token operator">=</span> <span class="token punctuation">{</span> <span class="token string">"rand_n_est_value"</span><span class="token punctuation">:</span> random<span class="token punctuation">.</span>randint<span class="token punctuation">(</span><span class="token number">250</span><span class="token punctuation">,</span> <span class="token number">500</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"rand_min_split_value"</span><span class="token punctuation">:</span> random<span class="token punctuation">.</span>choice<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">128</span><span class="token punctuation">,</span> <span class="token number">256</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">}</span> subprocess<span class="token punctuation">.</span>run<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token string">"dvc"</span><span class="token punctuation">,</span> <span class="token string">"exp"</span><span class="token punctuation">,</span> <span class="token string">"run"</span><span class="token punctuation">,</span> <span class="token string">"--queue"</span><span class="token punctuation">,</span> <span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.n_est=</span><span class="token interpolation"><span class="token punctuation">{</span>params<span class="token punctuation">[</span><span class="token string">'rand_n_est_value'</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">,</span> <span class="token string">"--set-param"</span><span class="token punctuation">,</span> <span class="token string-interpolation"><span class="token string">f"train.min_split=</span><span class="token interpolation"><span class="token punctuation">{</span>params<span class="token punctuation">[</span><span class="token string">'rand_min_split_value'</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">]</span><span class="token punctuation">)</span></code></pre></div> <p>This search could be far more complex with Bayesian optimization to handle the hyperparameter value selections, but we're keeping it super simple by choosing random numbers to focus on reproducibility. This will generate ten experiments with random values for each hyperparameter.</p> <p>You can run these new experiments with <a href="https://dvc.org/doc/command-reference/exp/run#--run-all"><code>dvc exp run --run-all</code></a> and then take a look at the results with <a href="https://dvc.org/doc/command-reference/exp/show#--include-params"><code>dvc exp show --include-params=train.min_split,train.n_est --no-timestamp</code></a>. Your table should look something like this.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bg-white"><span class="token hide">neutral:</span><span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span></span> <span class="token bg-yellow"><span class="token hide">metric:</span><span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span></span> <span class="token bg-blue"><span class="token hide">param:</span><span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span></span> </span> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>workspace<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>try-large-dataset<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.67038<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>0.96693<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>64<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>100<span class="token hide">**</span></span> ├── fc28c0c [exp-45902] 0.68358 0.96956 64 466 ├── f13ac72 [exp-b9dfa] 0.68275 0.96914 64 444 ├── a8cbc8f [exp-b0aeb] 0.68989 0.97003 32 260 ├── 4791c52 [exp-5f2b5] 0.67711 0.96809 128 497 ├── c5398e0 [exp-86c74] 0.6811 0.96829 64 374 ├── db16c91 [exp-db50f] 0.68986 0.97073 32 485 ├── 2dd08fa [exp-fee4f] 0.68262 0.96941 64 497 ├── 18d2ec5 [exp-d73c7] 0.67696 0.96726 128 341 ├── 1710032 [exp-dd198] 0.68756 0.9687 16 478 ├── 4f0b80a [exp-746c1] 0.68724 0.96811 16 379</span></code></pre></div> <p>This shows the difference in the randomly selected values and the values from grid search. You might find a better value with random search because it jumps around a range of values which might hit the optimum faster than it would with a grid search.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>With the comparison between grid search and random search, you can see how reproducibility can help you find the best model for your project. You'll be able to see all of the hyperparameter changes and code changes that created each model. This gives you the ability to fine tune your model because you can go to any experiment and resume training with different values, code, or data.</p>https://dvc.org/blog/july-21-dvc-heartbeathttps://dvc.org/blog/july-21-dvc-heartbeatFri, 16 Jul 2021 00:00:00 GMT<h1 id="welcome-to-summer" style="position:relative;">Welcome to Summer!<a href="#welcome-to-summer" aria-label="welcome to summer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><img src="https://media.giphy.com/media/WuY9yfI89DbNu/giphy.gif" alt="It's summer!"></p> <h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><span style="color:purple"><strong>A</strong></span>s usual we have a ton of goodness from the Community! Let's jump in!</p> <h2 id="antoine-toubhans-post-combining-streamlit-and-dvc" style="position:relative;">Antoine Toubhans' Post Combining Streamlit and DVC!<a href="#antoine-toubhans-post-combining-streamlit-and-dvc" aria-label="antoine toubhans post combining streamlit and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/antoine-toubhans-92262119/" target="_blank" rel="nofollow noopener noreferrer">Antoine Toubhans</a> of <a href="https://www.sicara.fr/" target="_blank" rel="nofollow noopener noreferrer">Sicara</a> wrote a fantastic and detailed tutorial entitled <a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer"><strong>How to Build Customizable Web UI for Machine Learning with Streamlit and DVC</strong></a> bringing together the best of DVC and integrating it with Streamlit to provide a customizable UI. The tutorial <span style="color:purple"><strong>g</strong></span>oes through the steps of setting up a pipeline, spltting a dataset, training and evaluating a model, tracking changes to data and model, dvc <span style="color:purple"><strong>m</strong></span>etrics and plots and then bridging the gap in visualizations using <span style="color:purple"><strong>S</strong></span>treamlit. You won't want to miss this one!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ca5a6eaba77575617f7935269fbefd1e/39600/streamlit2.png" alt="DVC and Streamlit" title="DVC and Streamlit" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC + Streamlit = ♥️! <a href="https://www.sicara.ai/blog/dvc-streamlit-webui-ml" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h2 id="dvc-and-cml-in-japanese" style="position:relative;">DVC and CML in Japanese!<a href="#dvc-and-cml-in-japanese" aria-label="dvc and cml in japanese permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>For our friends that speak Japanese, <a href="https://www.slideshare.net/yusukeshibui/testing-machine-learningdevelopment" target="_blank" rel="nofollow noopener noreferrer">these slides</a> created by <a href="https://www.slideshare.net/yusukeshibui?utm_campaign=profiletracking&utm_medium=sssite&utm_source=ssslideview" target="_blank" rel="nofollow noopener noreferrer">Yusuke Shibui</a> walk you through a machine learning to production project using D<span style="color:purple"><strong>V</strong></span>C and C<span style="color:purple"><strong>M</strong></span>L. We love seeing our tools being used all around the world! 🌏</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/32aab98c003f28761779d2fb79c247af/39600/in-japanese.png" alt="DVC and CML in Japanese" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DVC and CML in Japanese! <a href="https://www.slideshare.net/yusukeshibui/testing-machine-learningdevelopment" target="_blank" rel="nofollow noopener noreferrer">Source link</a></em></p> <h2 id="miguel-méndez-dvc-tutorial" style="position:relative;">Miguel Méndez' DVC Tutorial<a href="#miguel-m%C3%A9ndez-dvc-tutorial" aria-label="miguel méndez dvc tutorial permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/miguel-mendez/" target="_blank" rel="nofollow noopener noreferrer">Miguel Méndez</a> and his team at <a href="https://www.gradiant.org/en/" target="_blank" rel="nofollow noopener noreferrer">Gradiant</a> <span style="color:purple"><strong>s</strong></span>truggled with reproducibility before using DVC for versioning their image dataset and annotations. The dataset and annotaions are held in a shared storage space and used by the whole team. DVC enables the team to track changes and know what versions of the dataset produce the best results. His tutorial walks you through the steps to set it up!</p> <p> </p><section class="elp-content-holder"> <a href="https://mmeendez8.github.io/2021/07/01/dvc-tutorial.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Version Control Your Dataset with DVC</h4> <div class="elp-description">Miguel Méndez' tutorial on using DVC for versioning datasets and providing reproducibility</div> <div class="elp-link">https://github.io</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-07-16/git-dvc-77f8c394ced19aec2e78228f20003fd6.png" alt="Version Control Your Dataset with DVC"> </div> </a> </section> <p></p> <h2 id="jobs-requiring-dvc" style="position:relative;">Jobs requiring DVC!<a href="#jobs-requiring-dvc" aria-label="jobs requiring dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have been seeing an uptick in the number of jobs requiring knowledge of DVC. It's exciting to see that our tools are helpin<span style="color:purple"><strong>g</strong></span> these companies in their MLOps workflows! 🎉</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/58da1cf2344bc6fbbc522a2e67842e03/39600/job-descriptions.png" alt="job descriptions" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>With all those DVC job opportunities out there, you <span style="color:purple"><strong>b</strong></span>etter get on it! 😉</p> <h2 id="a-new-udacity-course-incorporating-dvc" style="position:relative;">A New Udacity Course Incorporating DVC!<a href="#a-new-udacity-course-incorporating-dvc" aria-label="a new udacity course incorporating dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Just this month a new <a href="https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821" target="_blank" rel="nofollow noopener noreferrer">Udacity</a> nannodegree program came out entitled <a href="https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821" target="_blank" rel="nofollow noopener noreferrer"><strong>Machine Learning DevOps Engineer</strong></a>, that teaches DVC as part of the program. This course includes sections on:</p> <ul> <li>Clean Code Principles</li> <li>Building a Reproducible <span style="color:purple"><strong>M</strong></span>odel Workflow</li> <li>Deploying a Scalable ML Pipeline in Production</li> <li>Automated Model Scoring and Monitoring</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Machine Learning DevOps Engineer</h4> <div class="elp-description">A new nanodegree program offered by Udacity teaching DVC as part of the curriculum</div> <div class="elp-link">https://udacity.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-07-16/udacity-26a2fd44369db81a4577a44b318f8559.png" alt="Machine Learning DevOps Engineer"> </div> </a> </section> <p></p> <h2 id="dvc-leaspan-stylecolorpurplerspann" style="position:relative;">DVC Lea<span style="color:purple"><strong>r</strong></span>n<a href="#dvc-leaspan-stylecolorpurplerspann" aria-label="dvc leaspan stylecolorpurplerspann permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This week we kicked off our new DVC Learn Meetup series with <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a>. This set of three, short, half-hour classes are designed to get you up and running in DVC. If you are just getting started with <span style="color:purple"><strong>D</strong></span>VC or kicking the tires, this Meetup series is for you! Our next class on August 4th will get you started with experiments.</p> <p>If you are interested in weighing in on what kinds of educational content you would like to see from us, we'd be grateful if you'd fill out <a href="https://docs.google.com/forms/d/e/1FAIpQLSdmwjs0ZkxDdODfZTvSwP2bVW4JAVVdxiYhQPyW5dSbsZC8qg/viewform?pli=1" target="_blank" rel="nofollow noopener noreferrer"><strong>this survey</strong></a> to help us plan! 🙏🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279447414/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC Learn - Getting Started: Experiments</h4> <div class="elp-description">The next DVC Learn Meetup taught by Melecia McGregor designed to get you started with DVC Experiments</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-07-16/dvc_learn-415db7d91a0670061d51698d6880fc57.png" alt="DVC Learn - Getting Started: Experiments"> </div> </a> </section> <p></p> <h2 id="data-science-journal-article-on-reproducibility-practices-in-research" style="position:relative;">Data Science Journal Article on Reproducibility Practices in Research<a href="#data-science-journal-article-on-reproducibility-practices-in-research" aria-label="data science journal article on reproducibility practices in research permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>New research presented in the <a href="https://datascience.codata.org/" target="_blank" rel="nofollow noopener noreferrer">Data Science Journal</a> aims to provide best practices for providing reproducibility in research datasets. This is necessary to pinpoint the version of the dataset that grounds any research. In this work the authors reviewed 39 use cases from 33 organizations to arrive at six principles for versioning datasets. These include <strong>Revision</strong>, <strong>Release</strong>, <strong>Granularity</strong>, <strong>Manifestation</strong>, <span style="color:purple"><strong>P</strong></span><strong>rovenance</strong> and <strong>Citation</strong>. See the full work below. 👇🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://datascience.codata.org/articles/10.5334/dsj-2021-012/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Versioning Data is About More Than Revisions: A Conceptual Framework and Proposed Priniciples</h4> <div class="elp-description">Authors analyze 39 use cases in 33 organziations to arrive at proposed principles when versioning data.</div> <div class="elp-link">https://datascience.codata.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-07-16/dsj-7d89083918a0e490fd9e1511e6ad40bc.png" alt="Versioning Data is About More Than Revisions: A Conceptual Framework and Proposed Priniciples"> </div> </a> </section> <p></p> <h2 id="june-office-hours-meetup" style="position:relative;">June Office Hours Meetup<a href="#june-office-hours-meetup" aria-label="june office hours meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The June Office Hours Meetup was 🔥! Amazing discussion on experiments ignited by <a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer">Sami Jawhar</a> of <a href="https://www.kernel.com/" target="_blank" rel="nofollow noopener noreferrer">Kernel</a> around experiment use cases and workflows.<br> You can <a href="https://github.com/sjawhar/dvc-cloud-runner" target="_blank" rel="nofollow noopener noreferrer">find the repo for his presentation here</a> and watch all the great DVC discussion below.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/DxZdWq3Weng?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><span style="color:purple"><strong>S</strong></span>ummer and vaccinations mean travel! ☀️💉 And that travel has enabled some of our team members to get together! Pictured below are Dmitry Petrov, Alexander Guschin, Max Shmakov, Mikhail Rozhkov, Sergey Kryukov, Mikhail Sveshnikov, and Guro Bokum… But not necessarily in that order.</p> <p>The first person to guess the correct order of our teammates starting from the upper right of the picture moving clockwise, <strong>and</strong> post in the corresponding <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> Heartbeat post, will win some DVC SWAG! Hint: If you've been wondering why there are random purple letters in this blog post, they're a clue to this cipher. 🧐</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 661px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/7c5e525a8aaaf6a2e6a3cb01591fc88a/d5cf8/team.png" alt="team" title="team" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Team Meetup in Moscow! (hand signals obscured for our UK friends, because we care! 🤗)</em></p> <h2 id="new-team-member" style="position:relative;">New Team Member<a href="#new-team-member" aria-label="new team member permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/david-de-la-iglesia-castro-b4b67b20a/" target="_blank" rel="nofollow noopener noreferrer">David de la Iglesia Castro</a> is the third teammate joining us from Spain! 🇪🇸 And also the third David! He hails from Galicia and has been an active member of our Community for over two years. We are so excited to have him join the team as a software enginer where he will wor<span style="color:purple"><strong>k</strong></span> to improve DVC Live. When he's not contributing to DVC, David likes to go climbing, surfing or just hiking whenever he can! Welcome David!</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>And yes indeed, we are still hiring! <a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions including:</p> <ul> <li>Senior Front-End Engineer (TypeScript, Node, React)</li> <li>Senior Software Engineer (ML, Dev Tools, Python)</li> <li>Senior Software Engineer (ML, Data Infra, GoLang)</li> <li>Machine Learning Engineer/Field Data Scientist</li> <li>Developer Advocate (ML)</li> <li>Director/VP of Engineering (ML, DevTools)</li> <li>Director/VP of Product (ML, Data Infra, SaaS)</li> <li>Director/VP of Operations/Chief of Staff</li> </ul> <p>Please pass this info on to anyone you know that may fit the bill. We look forward to new team members! 🎉</p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Don't miss our <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a> July 28th at 2:00 pm UTC (7:00 am PDT), where <a href="https://www.linkedin.com/in/jcpsantiago/" target="_blank" rel="nofollow noopener noreferrer">João Santiago</a> of <a href="https://www.billie.io/" target="_blank" rel="nofollow noopener noreferrer">Billie</a> will present "DVThis" a set of utility functions for DVC pipelines using R scripts. Additionally the project aims to document the usual workflows of a DVC pipeline using these scripts and create templates for the use of DVC and R together.</p> <p>Following Santiago, team member <a href="https://www.linkedin.com/in/tapa-dipti-sitaula/" target="_blank" rel="nofollow noopener noreferrer">Tapa Dipti Sitaula</a> will give a demo of DVC Studio! Bring your questions; we look forward to seeing you!</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVThis</h4> <div class="elp-description">July DVC Office Hours with João Santiago of Billie shows us how to use R with DVC, presenting DVThis and Tapa Dipti Sitaula shares a demo of DVC Studio.</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-07-16/office-hours-meetup-4d64171025fb882a3b68512f807f2d53.png" alt="DVThis"> </div> </a> </section> <p></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Fantastically detailed tutorial from <a href="https://twitter.com/AntoineToubhans">@AntoineToubhans</a> on how to build a customizable web UI for <a href="https://twitter.com/hashtag/MachineLearning?src=hash&ref_src=twsrc%5Etfw">#MachineLearning</a> with <a href="https://twitter.com/streamlit">@Streamlit</a> and <a href="https://twitter.com/DVCorg">@DVCorg</a>! 🐍🎈<a href="https://t.co/zrZCueWk0n">https://t.co/zrZCueWk0n</a></p>— Charly Wargnier (@DataChaz) <a href="https://twitter.com/DataChaz/status/1410319379837894656">June 30, 2021</a></blockquote> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/june-21-community-gemshttps://dvc.org/blog/june-21-community-gemsWed, 30 Jun 2021 00:00:00 GMT<h3 id="q-is-it-possible-to-plot-multiple-experiments-together" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/834387923482181653" target="_blank" rel="nofollow noopener noreferrer">Q: Is it possible to plot multiple experiments together?</a><a href="#q-is-it-possible-to-plot-multiple-experiments-together" aria-label="q is it possible to plot multiple experiments together permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can use experiment names in the <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a> commands. You need to use the <code>diff</code> command to compare multiple plots. Try <a href="https://dvc.org/doc/command-reference/plots/diff#-"><code>dvc plots diff exp-6ef18 exp-b17b4 exp-26e88</code></a>.</p> <p>Thanks to @PythonF from Discord for asking this question that led to this Gem! 💎</p> <h3 id="q-where-is-the-list-of-experiment-being-pushed-in-git-when-i-run-dvc-exp-push" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/837773937390649364" target="_blank" rel="nofollow noopener noreferrer">Q: Where is the list of experiment being pushed in Git when I run <code>dvc exp push</code>?</a><a href="#q-where-is-the-list-of-experiment-being-pushed-in-git-when-i-run-dvc-exp-push" aria-label="q where is the list of experiment being pushed in git when i run dvc exp push permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It uses custom Git refs internally, similar to the way GitHub handles PRs. It’s a custom DVC Git ref pointing to a Git commit. Here's an example.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref exp-26220 </span>c42f48168830148b946f6a75d1bdbb25cda46f35 refs/exps/f1/37703af59ba1b80e77505a762335805d05d212/exp-26220</code></pre></div> <p>If you want to see your local experiments (that have not been pushed), you can run <a href="https://dvc.org/doc/command-reference/exp/list#--all"><code>dvc exp list --all</code></a>.</p> <p>You can read more about how we handle our custom Git refs in <a href="https://dvc.org/blog/experiment-refs" target="_blank" rel="nofollow noopener noreferrer">this blog post</a>.</p> <p>Thanks to @Chandana for asking this question about experiments!</p> <h3 id="q-is-there-a-way-to-list-all-the-experiments-i-have-on-my-dvc-remote-that-have-not-been-committed-to-git" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/836705209039978538" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to list all the experiments I have on my DVC remote that have not been committed to Git?</a><a href="#q-is-there-a-way-to-list-all-the-experiments-i-have-on-my-dvc-remote-that-have-not-been-committed-to-git" aria-label="q is there a way to list all the experiments i have on my dvc remote that have not been committed to git permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes! You can quickly look at all of the experiments in any repo with:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp list</span> <span class="token parameter variable">--all</span> <span class="token operator"><</span>git repo URL<span class="token operator">></span></span></code></pre></div> <p>or</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp list</span> <span class="token parameter variable">--all</span> <span class="token operator"><</span>git remote<span class="token operator">></span></span></code></pre></div> <p>Thanks again @Chandana for this gem!</p> <h3 id="q-is-cml-compatible-with-azure-devops" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/841664412221177926" target="_blank" rel="nofollow noopener noreferrer">Q: Is CML compatible with Azure DevOps?</a><a href="#q-is-cml-compatible-with-azure-devops" aria-label="q is cml compatible with azure devops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Another great question from @Chandana!</p> <p>Right now, we support GitHub and GitLab.</p> <p>Azure DevOps and GCP (Google Cloud Platform) support are on the roadmap. Stay tuned for more updates!</p> <p>You can stay up to date with our Azure DevOps progress on <a href="https://github.com/iterative/cml/issues/142" target="_blank" rel="nofollow noopener noreferrer">this issue on GitHub</a>. You can also follow along with GCP updates with <a href="https://github.com/iterative/terraform-provider-iterative/issues/64" target="_blank" rel="nofollow noopener noreferrer">this issue</a>.</p> <h3 id="q-i-pushed-a-lot-of-files-using-dvc-push-to-my-dvc-remote-but-there-are-a-few-that-couldnt-be-pushed-at-the-time-if-i-run-dvc-push-again-will-it-just-upload-the-missing-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/842662337159757854" target="_blank" rel="nofollow noopener noreferrer">Q: I pushed a lot of files using <code>dvc push</code> to my DVC remote, but there are a few that couldn't be pushed at the time. If I run <code>dvc push</code> again, will it just upload the missing files?</a><a href="#q-i-pushed-a-lot-of-files-using-dvc-push-to-my-dvc-remote-but-there-are-a-few-that-couldnt-be-pushed-at-the-time-if-i-run-dvc-push-again-will-it-just-upload-the-missing-files" aria-label="q i pushed a lot of files using dvc push to my dvc remote but there are a few that couldnt be pushed at the time if i run dvc push again will it just upload the missing files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @petek!</p> <p>Yes! You can just re-run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> and it will only upload the missing files.</p> <p>It might be a little slower than you would expect because DVC has to do some checks to make sure that the other files were uploaded successfully before, but as far as the actual data transfer goes, only the missing files will be uploaded.</p> <h3 id="q-lets-say-i-have-a-dvc-pipeline-with-two-stages-can-i-only-pull-the-second-one-and-keep-the-first-one-for-other-uses-can-i-pull-some-specific-output-from-the-pipeline" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/841688323663855616" target="_blank" rel="nofollow noopener noreferrer">Q: Let's say I have a DVC pipeline with two stages, can I only pull the second one and keep the first one for other uses? Can I pull some specific output from the pipeline?</a><a href="#q-lets-say-i-have-a-dvc-pipeline-with-two-stages-can-i-only-pull-the-second-one-and-keep-the-first-one-for-other-uses-can-i-pull-some-specific-output-from-the-pipeline" aria-label="q lets say i have a dvc pipeline with two stages can i only pull the second one and keep the first one for other uses can i pull some specific output from the pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can pull specific outputs from a pipeline with <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull path/to/specific/output</code></a>. This is similar to how you can use <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to work with specific files and directories.</p> <p>Thanks for such a great question @LucZ!</p> <h3 id="q-how-does-dvc-handle-incremental-changes-in-the-data-and-how-does-it-work-with-non-dvc-based-pipeline-features" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/846364469524430848" target="_blank" rel="nofollow noopener noreferrer">Q: How does DVC handle incremental changes in the data and how does it work with non-DVC based pipeline features?</a><a href="#q-how-does-dvc-handle-incremental-changes-in-the-data-and-how-does-it-work-with-non-dvc-based-pipeline-features" aria-label="q how does dvc handle incremental changes in the data and how does it work with non dvc based pipeline features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>These are good questions for common problems in MLOps from @Phoenix!</p> <p>To answer the first part, say you are getting new data every week. When you use DVC, you don't have to worry about getting duplicate data.</p> <p>DVC supports file-level deduplication right now, so if your data is in a shape of directory with files, then all unique files will only be stored once. Chunk-level deduplication is on our todo list. You can see how it's going in <a href="https://github.com/iterative/dvc/issues/829" target="_blank" rel="nofollow noopener noreferrer">this issue we have on GitHub</a>.</p> <p>For the second part of the question, you can use data management with DVC and have your own pipelines. Just treat it as Git for data then be sure to <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and you should be set. Hooks, like <code>pre-commit</code> or <code>post-pipeline-run</code>, are a good way to go about it.</p> <h3 id="q-is-there-a-way-to-tell-dvc-to-use-a-different-profile-instead-of-the-default-profile-when-interacting-with-s3" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/846857498094469120" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to tell DVC to use a different profile instead of the default profile when interacting with S3?</a><a href="#q-is-there-a-way-to-tell-dvc-to-use-a-different-profile-instead-of-the-default-profile-when-interacting-with-s3" aria-label="q is there a way to tell dvc to use a different profile instead of the default profile when interacting with s3 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When you have a remote that is not on your default AWS profile and when you access it via the <code>awscli</code> using something like <code>aws s3 --profile=second_profile ls</code>, you'll need to update your remote config in DVC.</p> <p>You can run a command like:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote profile myprofile</span></code></pre></div> <p>Check out the docs on <a href="https://dvc.org/doc/command-reference/remote/modify"><code>dvc remote modify</code></a> for all the remote config options.</p> <p>Great question @Avi!</p> <hr> <p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p> <p>At our July Office Hours Meetup we will be demo-ing pipelines as well as CML. <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/279024694/" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/june-21-dvc-heartbeathttps://dvc.org/blog/june-21-dvc-heartbeatFri, 18 Jun 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>This month I'm going to take you on a thought provoking journey through some of the content from our community.</p> <p><img src="https://media.giphy.com/media/Uni2jYCihB3fG/giphy.gif" alt="So many choices..."></p> <h2 id="lj-mirandas-triad-of-order" style="position:relative;">LJ Miranda's Triad of order<a href="#lj-mirandas-triad-of-order" aria-label="lj mirandas triad of order permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The MLOps tool landscape can be confusing to say the least.<br> <a href="https://twitter.com/ljvmiranda921" target="_blank" rel="nofollow noopener noreferrer">LJ Miranda</a>, in a well written <a href="https://ljvmiranda921.github.io/notebook/2021/05/10/navigating-the-mlops-landscape/" target="_blank" rel="nofollow noopener noreferrer">three-part series</a> lays out a framework for making sense of this space. The list of tools is not exhaustive, but the framework and thought process for evaluating the tools is intriguing. Additionally he encourages thinking about the skillset of the members of your team within this framework to help you make decisions on the right tools. It's not just about the tools, it's about the people!</p> <p>As you can see DVC makes it into the "Trial" loop, but we think we will be be making it into the adoption region in relatively short order. 😉🚀</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 675px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e9aa3f0108be0f8003703fa2dce42573/39600/LJMiranda.png" alt="LJMiranda" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Making sense of the MLOps Landscape</em></p> <h2 id="found-in-the-mlops-community" style="position:relative;">Found in the MLOps Community<a href="#found-in-the-mlops-community" aria-label="found in the mlops community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>You can find more comments from LJ Miranda and others in response to a <a href="https://mlops-community.slack.com/?redir=%2Farchives%2FC015J2Y9RLM%2Fp1622714574054300" target="_blank" rel="nofollow noopener noreferrer">great question</a> from André Godinho in the <a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer">MLOps Community</a> Slack (see below). If you're into MLOps and you're NOT a part of this Community, you should be. You can join their Slack <a href="https://mlops-community.slack.com/join/shared_invite/zt-o96abp9z-sRYKWb96wGK9vdhUvbSrsQ#/shared-invite/email" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <blockquote> <p>I have recently came across with DVC by listening to MLOps Coffee Sessions #6 with David Aponte and Elle O'Brien (Such an interesting talk! 💯). This tool integrates smoothly with Git, tracks models & datasets, and also has an online UI DVC Studio 🚀. Is there any use case of MLflow that DVC can't handle? I find DVC to give more rise to creativity as it integrates really well with Git. - André Godinho</p> </blockquote> <h2 id="neda-sultovas-tutorial-and-tool-rubric" style="position:relative;">Neda Sultova's Tutorial and Tool Rubric<a href="#neda-sultovas-tutorial-and-tool-rubric" aria-label="neda sultovas tutorial and tool rubric permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Drilling down to the next level, I give you <a href="https://medium.com/geekculture/exploring-dvc-for-machine-learning-pipelines-in-research-part-1-3ebc2ca35a18" target="_blank" rel="nofollow noopener noreferrer">this tutorial</a> by <a href="https://www.linkedin.com/in/neda-sultova-597a811a8/" target="_blank" rel="nofollow noopener noreferrer">Neda Sultova</a>. Not only is it a great tutorial of DVC in and of itself, but Neda also defines a clear framework for the decision making process at <a href="https://www.helmholtz.ai/" target="_blank" rel="nofollow noopener noreferrer">Helmholtz AI</a>. Among the needs are reproducibility, workflow integration, exchangeable backend, framework agnostic, open source, and the ability of the solution to be tweaked to the team's needs.</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/geekculture/exploring-dvc-for-machine-learning-pipelines-in-research-part-1-3ebc2ca35a18" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Exploring DVC for Machine Learning Pipelines in Research (Part 1)</h4> <div class="elp-description">The first of a multi-part series on the search and decision making process for MLOps tools at Helmholtz AI.</div> <div class="elp-link">https://medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-06-18/neda-sultova-eaa1b8385248cb7e979bc5bc7c3a3461.png" alt="Exploring DVC for Machine Learning Pipelines in Research (Part 1)"> </div> </a> </section> <p></p> <h2 id="our-philosophy" style="position:relative;">Our Philosophy<a href="#our-philosophy" aria-label="our philosophy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>And at last I bring you to <a href="https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/" target="_blank" rel="nofollow noopener noreferrer">"The Road to AI Hell Starts with Good MLOps Intentions" </a> by our CEO <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> which explains our philosophy in the MLOps space. You will learn about the experiences that led to developing our tools, what we think is the right way to solve MLOps challenges, and how we do it.</p> <blockquote> <p>Teams made up of data scientists and developers should be able to define their own workflow based on their business requirements and team preferences, just like they do today when constructing any other software artifact. Rather than a platform forcing teams to embrace a highly opinionated workflow, they can employ flexible tools such Git, GitHub, and their existing CI tools as they see fit. - Dmitry Petrov</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://thenewstack.io/the-road-to-ai-hell-starts-with-good-mlops-intentions/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The Road to AI Hell Starts with Good MLOps Intentions</h4> <div class="elp-description">Dmitry Petrov explains the journey and philosophy at the heart of Iterative.ai's MLOps tools.</div> <div class="elp-link">https://thenewstack.io</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-06-18/ai-hell-12bc4bdc3b703583bb29a50479563837.png" alt="The Road to AI Hell Starts with Good MLOps Intentions"> </div> </a> </section> <p></p> <h1 id="big-news-" style="position:relative;">Big News! 🚀🚀🚀<a href="#big-news-" aria-label="big news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>In case you missed it, June 3rd we introduced our latest tool: DVC Studio! A web application that GUI display your team's work with DVC and CML. We know this has been on our Community's wishlist and now it's here! You can check out all its features and <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">give it a try here</a>. Or check out the introduction video below.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/hKf4twg832g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h1 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="r-for-dvc" style="position:relative;">R for DVC!<a href="#r-for-dvc" aria-label="r for dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Are you or someone on your team an R user? <a href="https://twitter.com/jcpsantiago" target="_blank" rel="nofollow noopener noreferrer">João Santiago</a> who has contributed to DVC, recently came up with "dvcru" to provide utility functions for DVC pipelines using R scripts. Additionally the project aims to show typical workflows they enable as well as provide project templates. Check out all the R goodness in <a href="https://github.com/jcpsantiago/dvcru" target="_blank" rel="nofollow noopener noreferrer">this Github Repository</a>.</p> <p> </p><section class="elp-content-holder"> <a href="https://github.com/jcpsantiago/dvcru" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">dvcru</h4> <div class="elp-description">João Santiago's repository for dvcru, providing utility functions for DVC Pipelines using R scripts.</div> <div class="elp-link">https://github.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-06-18/r-76a084e1e5c947fe6dfecf2312218942.png" alt="dvcru"> </div> </a> </section> <p></p> <h2 id="milecia-mcgregor-at-pydata-socal" style="position:relative;">Milecia McGregor at PyData SoCal<a href="#milecia-mcgregor-at-pydata-socal" aria-label="milecia mcgregor at pydata socal permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Next up we have <a href="https://twitter.com/FlippedCoding" target="_blank" rel="nofollow noopener noreferrer">Milecia McGregor</a> presenting and live coding at <a href="https://www.meetup.com/PyData-SoCal/" target="_blank" rel="nofollow noopener noreferrer">PyData SoCal</a> organized by <a href="https://twitter.com/MaverickPramit" target="_blank" rel="nofollow noopener noreferrer">Pramit Choudhary</a>. Check out her talk on "Reproducible ML Experiments (with Git and DVC)" and all the great questions that ensued.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/h0vDuw3s2fE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="dmitry-petrov-at-mlops-world" style="position:relative;">Dmitry Petrov at MLOps World<a href="#dmitry-petrov-at-mlops-world" aria-label="dmitry petrov at mlops world permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Finally we have <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov's</a> talk at the <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps World Conference</a> about machine learning in production entitled "Data Versioning and ML Experiments on Top of Git."</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Lc0hsT-i7qo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We're still growing! Meet this month's new team members.</p> <h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/jelle-bouwman/" target="_blank" rel="nofollow noopener noreferrer">Jelle Bouwman</a> joins us from Utrecht, Netherlands as a software engineer. He's worked as a consultant and at an agency. He's most proud of <a href="https://rotterdam.navigate-connections.com/voyages" target="_blank" rel="nofollow noopener noreferrer">the work he did with his team at the Port of Rotterdam</a>. In his free time, Jelle loves reading fiction and books on human psychology/productivity, hiking and making music with others. He has already shared with the team a <a href="https://open.spotify.com/album/1LqgEMQNmL2yvjsGpihGee?si=7tCaG8-QQ92xvrlVvaUR7A" target="_blank" rel="nofollow noopener noreferrer">great playlist</a> to listen to while trying to focus! Welcome Jelle! 🎼</p> <p>Next we welcome <a href="https://www.linkedin.com/in/1aguschin/" target="_blank" rel="nofollow noopener noreferrer">Alexander Gushcin</a>. Alexander joins us from Russia where he has been a Data Scientist/ML Engineer for the last five years. He's also participated in many Kaggle competitions and was ranked 5th in general competitions at some point! This led him to create a Coursera course on <a href="https://www.coursera.org/learn/competitive-data-science" target="_blank" rel="nofollow noopener noreferrer">how to win data science competitions</a> about the tips and tricks needed to win one. Teaching is his passion and you will probably see him producing some content in the near future. 🧑🏽‍💻</p> <p><a href="https://www.linkedin.com/in/mike0sv/" target="_blank" rel="nofollow noopener noreferrer">Mikhail Sveshnikov</a> also joins us from Russia where he formerly worked as a Data Engineer Team Lead for Rubbles. He created <a href="https://github.com/zyfra/ebonite" target="_blank" rel="nofollow noopener noreferrer">ebonite</a>, an ML deployment tool and teaches Python and Big Data at HSE University. Finally he is one of the admins of <a href="https://ods.ai/" target="_blank" rel="nofollow noopener noreferrer">ods.ai</a> community, which creates global projects to unite the community, promote Data Science, and help people develop their skills. In his spare time he likes to play guitar, badminton, ski, and mix cocktails. 🍸 Cheers Mikhail!</p> <p><a href="https://www.linkedin.com/in/jervishui/" target="_blank" rel="nofollow noopener noreferrer">Jervis Hui</a> is joining the go-to-market team at Iterative and is from NYC. He's worked in product marketing at various Silicon Valley tech companies over the years and is excited to bring his experience to the open source world of Iterative. He's passionate about D&I in hiring and looks forward to learning from everyone! We're excited to have Jervis on board! 🎉</p> <p><img src="https://media.giphy.com/media/Kzo0heGPi6xwjpC5JL/giphy.gif" alt="Hiring GIF"></p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>And yes indeed, we are still hiring! <a href="https://www.notion.so/iterative/iterative-ai-is-hiring-852cb978129645e1906e2c9a878a4d22" target="_blank" rel="nofollow noopener noreferrer">Use this link</a> to find details of all the positions including:</p> <ul> <li>Senior Front-End Engineer (TypeScript, Node, React)</li> <li>Senior Software Engineer (ML, Dev Tools, Python)</li> <li>Senior Software Engineer (ML, Data Infra, GoLang)</li> <li>Machine Learning Engineer/Field Data Scientist</li> <li>Developer Advocate (ML)</li> <li>Director/VP of Engineering (ML, DevTools)</li> <li>Director/VP of Product (ML, Data Infra, SaaS)</li> <li>Director/VP of Operations/Chief of Staff</li> </ul> <p>Please pass this info on to anyone you know that may fit the bill. We look forward to new team members! 🎉</p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Don't miss our <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a> June 24th at 3:00 pm UTC (8:00 am PDT), where <a href="https://www.linkedin.com/in/sami-jawhar-a58b9849/" target="_blank" rel="nofollow noopener noreferrer">Sami Jawhar</a> of Kernel will present different experiment use cases. Bring your questions and thinking cap! It's bound to be a great session!</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/278729121/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">dvcru</h4> <div class="elp-description">June DVC Office Hours with Sami Jawhar of Kernel presenting experiment use cases.</div> <div class="elp-link">https://meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-06-18/meetup-a796bdc01514d6fdf2c359ca264f8ef9.png" alt="dvcru"> </div> </a> </section> <p></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Some people asked about <a href="https://twitter.com/DVCorg">@DVCorg</a> and how I use it so here's am <a href="https://twitter.com/hashtag/Rstats?src=hash&ref_src=twsrc%5Etfw">#Rstats</a> 📦 I'm creatively calling {dvcru} with some utility functions and documentation about how to use the DVC workflow. It will also bootstrap a project with DVC once I push some changes. Check it out!</p>— João Santiago (@jcpsantiago) <a href="https://twitter.com/jcpsantiago/status/1402221732480569349">June 8, 2021</a></blockquote> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/introducing-dvc-studiohttps://dvc.org/blog/introducing-dvc-studioWed, 02 Jun 2021 00:00:00 GMT<p>We are excited to release DVC Studio - the online UI for DVC and CML.</p> <p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> and <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> have been widely used by ML engineers, data scientists and researchers to simplify their Machine Learning processes. With 8000 GitHub 🌟 and 200+ open source contributors, they have gained popularity as tools that take advantage of the existing engineering toolset that you're already familiar with (Git, CI/CD, etc.) to provide you the best practices for organizing your data and ML projects and collaborating effectively. DVC Studio, an extension on top of DVC and CML, adds even more capabilities to your MLOps toolset.</p> <p>DVC Studio is a big new step for our team. Many of you have rightly pointed out the <a href="https://github.com/iterative/dvc/issues/1074" target="_blank" rel="nofollow noopener noreferrer">need for a visual UI</a> for DVC. Your needs, <a href="https://github.com/iterative/dvc/discussions/5941" target="_blank" rel="nofollow noopener noreferrer">ideas and suggestions</a> are our priority. And so, we are thrilled that our new product will make your ML journeys even more smooth.</p> <h2 id="how-does-dvc-studio-work" style="position:relative;">How does DVC Studio work?<a href="#how-does-dvc-studio-work" aria-label="how does dvc studio work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC Studio is a web application that you can <a href="https://studio.datachain.ai/" target="_blank" rel="nofollow noopener noreferrer">access online</a> or even host on-prem. It works with the data, metrics and hyperparameters that you add to your ML project repositories.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b517c192e0755ae304bc4427d44d5cc8/39600/dvc-studio-view.png" alt="dvc studio view" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Each experiment, represented by a commit in your Git history, is presented along with its data, metrics and hyperparameters. This is your playground for visualizing, comparing and even running experiments.</em></p> <p>With DVC Studio we rely on you saving information into your Git repository. Connect DVC Studio with GitHub, GitLab or Bitbucket to read repositories and to run new experiments (using regular CI/CD capabilities - we'll talk about this in a moment).</p> <p>DVC Studio analyzes Git history and extracts information about your ML experiments - datasets being used, metrics and hyperparameters. By using DVC, you can be sure not to bloat your repositories with large volumes of data or huge models. These large assets reside in cloud or other remote storage locations (and we don't require you giving us access to it!).</p> <h2 id="visualize-collaborate-track" style="position:relative;">Visualize. Collaborate. Track.<a href="#visualize-collaborate-track" aria-label="visualize collaborate track permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This video shows you how you can visualize your experiments using DVC Studio.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/hKf4twg832g?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>DVC, along with Git, performs your ML bookkeeping automatically. Using a simple UI, you can import your experiment history from Git. You can get quick access to important metrics across multiple projects, or dive deep and explore individual experiments. You can visualize and compare models the way that best fits your needs, whether it is through precision-recall curves, scores comparison, or trend charts showing how your model is evolving over time.</p> <p>This makes it easy to see exactly how your model’s performance changed when you increased the number of layers in your neural net, added some more samples to your training dataset, or increased the number of epochs to run the training for.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eab5b427468a7c6ddc2ce8f487243048/39600/trends-chart.png" alt="trends chart" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>With DVC Studio, you can visualize your model evolution. This Trends chart, for instance, shows how the average precision increased over the course or your experiments.</em></p> <p>You will get the dashboard and all the visuals automatically if your metrics and plots are stored in Git through DVC. But if you do not use DVC, you can still add custom files with your metrics and parameters and DVC Studio will efficiently generate tables and plots for your custom input.</p> <p>DVC Studio also provides visual UI to create and manage teams, manage roles, and share your experiment tables, enabling easy and efficient knowledge sharing and collaboration.</p> <h2 id="use-git-for-ml-metrics-tracking-nothing-fancy" style="position:relative;">Use Git for ML metrics tracking. Nothing fancy.<a href="#use-git-for-ml-metrics-tracking-nothing-fancy" aria-label="use git for ml metrics tracking nothing fancy permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Most ML engineers already use Git for code versioning. <a href="https://dvc.org/doc/command-reference/init"><code>dvc init</code></a>, <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> - these simple Git-like DVC commands are all you need to convert your Git repos into DVC repos - a single source of truth for not just your code but also your data, model and metrics.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/5xM5az78Lrg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>What makes DVC Studio special is this connection to the Git ecosystem. The table and visuals in DVC Studio aren’t magic - they are simply a representation of the data in JSON or CSV files in your Git repositories.</p> <h2 id="automate-your-ml-process-no-code" style="position:relative;">Automate your ML process. No-code.<a href="#automate-your-ml-process-no-code" aria-label="automate your ml process no code permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Mature ML teams reuse their code over and over again while tuning data and hyperparameters. DVC Studio automates this in the visual user interface. To run an experiment on DVC Studio, use its UI to modify the ML model hyperparameters and dataset version. The modifications and the message you enter will be automatically converted to a proper Git commit. Your team members can see the changes through your Git platform or DVC Studio and track the author and timestamp of the change.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/nXJXR-zBvHQ?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>If your project is integrated with the CI/CD process, the model training process will be automatically triggered. Once the experiment completes, all its inputs and outputs are available in DVC Studio, ready for visualizing and comparing. This visual modification helps your team to iterate faster and avoid mistakes with manual code changes.</p> <p><a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a> can create reports and orchestrate resources in your cloud (GCP, AWS or Azure) or Kubernetes to run training. Because this is cloud-agnostic, you are not tied to a particular cloud provider, and this helps you avoid vendor lock-in.</p> <p>With this approach, the managers, and DevOps folks who are not experts in creating ML models, can also be part of the ML model training process. They can re-train your model on a new version of the dataset or try other changes to your model.</p> <h2 id="create-magic" style="position:relative;">Create magic!<a href="#create-magic" aria-label="create magic permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>So, don’t reinvent the wheel. Use Git. Through a simple UI. Use your existing CI/CD setup. Use your existing cloud. Get the most out of them. And create magic :) Okay, the tables and visuals in DVC Studio aren’t magic, but they sure are magical. Right?</p> <h2 id="get-started-now" style="position:relative;">Get started now<a href="#get-started-now" aria-label="get started now permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Get started at <a href="https://studio.datachain.ai" target="_blank" rel="nofollow noopener noreferrer">https://studio.datachain.ai</a>. Simply connect with your GitHub, GitLab or Bitbucket account. No additional sign-ups are required.</p> <p>For more information on how to use DVC Studio, please check out the <a href="https://dvc.org/doc/studio" target="_blank" rel="nofollow noopener noreferrer">docs</a>.</p> <p>DVC Studio is completely free for individuals and small teams. Let us know if you would like to set up DVC Studio for<a href="https://form.typeform.com/to/nydf3Oys?typeform-medium=embed-snippet" target="_blank" rel="nofollow noopener noreferrer"> 5+ member teams</a> or for <a href="https://form.typeform.com/to/bd9lTEt9?typeform-medium=embed-snippet" target="_blank" rel="nofollow noopener noreferrer">enterprises</a>, and we will get back to you soon.</p> <p>We would love to get your feedback. Reach out to us with your questions, concerns or requests on <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>. Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas. You can also raise an issue on <a href="https://github.com/iterative/studio-support" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p> <p>We are super excited to have you use DVC Studio. We’re confident that it’ll make your Machine Learning journeys so much easier. We can’t wait to hear how it goes.</p>https://dvc.org/blog/may-21-community-gemshttps://dvc.org/blog/may-21-community-gemsFri, 28 May 2021 00:00:00 GMT<p>Each month we go through our Discord messages to pull out some of the best questions from our community. AKA: Community Gems. 💎 This month we'd like to thank @asraniel, @PythonF, @mattlbeck, @Ahti, @yikeqicn, @lexzen, @EdAb, @FreshLettuce for inspiring this month's gems!</p> <p>As always, <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">join us in Discord</a> to get all your DVC and CML questions answered!</p> <h2 id="dvc" style="position:relative;">DVC<a href="#dvc" aria-label="dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="what-is-the-best-way-to-commit-2-experiment-runs" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/836626346544594995" target="_blank" rel="nofollow noopener noreferrer">What is the best way to commit 2 experiment runs?</a><a href="#what-is-the-best-way-to-commit-2-experiment-runs" aria-label="what is the best way to commit 2 experiment runs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You want to use <a href="https://dvc.org/doc/command-reference/exp/branch"><code>dvc exp branch</code></a> if you want to keep multiple experiments. That way, each one is in a separate branch rather than trying to apply one experiment on top of another.</p> <h3 id="how-can-i-clean-up-the-remote-caches-after-a-lot-of-experiments-and-branches-have-been-pushed" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/831142466169733120" target="_blank" rel="nofollow noopener noreferrer">How can I clean up the remote caches after a lot of experiments and branches have been pushed?</a><a href="#how-can-i-clean-up-the-remote-caches-after-a-lot-of-experiments-and-branches-have-been-pushed" aria-label="how can i clean up the remote caches after a lot of experiments and branches have been pushed permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp gc</code></a> requires some kind of flags to operate. At the very least, <code>--workspace</code>. So, with <code>--workspace</code>, <code>dvc</code> will try to read all of the pointer files: <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files and <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files in the workspace. It will read all of them and will determine all the cache objects/files that need to be preserved (since they are being used in the current workspace). The rest of the files in the <code>.dvc/cache</code> are removed.</p> <p><em>This does not require any Git operations!</em></p> <p>You can also use the <code>--all-branches</code> flag. It will read all of the files present in the current workspace and from the commits in the branches you have locally. Then it will use that list to determine what to keep and what to remove.</p> <p>If you need to read pointer files from given tags you have locally, the <code>--all-tags</code> flag is the best option.</p> <p>The <code>--all-commits</code> flag reads pointer files from every commit and it will make a list of all the files that are in the cache/remote and if the <em>.dvc</em> file isn't found in any commits of the Git repo, it will delete those files.</p> <h3 id="if-i-have-two-cloud-folder-links-added-to-the-dvc-config-im-able-to-push-the-data-to-the-default-one-how-could-i-push-the-data-to-the-other-cloud-folder" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/833176227762274364" target="_blank" rel="nofollow noopener noreferrer">If I have two cloud folder links added to the DVC config, I'm able to push the data to the default one. How could I push the data to the other cloud folder?</a><a href="#if-i-have-two-cloud-folder-links-added-to-the-dvc-config-im-able-to-push-the-data-to-the-default-one-how-could-i-push-the-data-to-the-other-cloud-folder" aria-label="if i have two cloud folder links added to the dvc config im able to push the data to the default one how could i push the data to the other cloud folder permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You're looking for the <code>-r / --remote</code> option for <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>. The command looks like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">--remote</span> <span class="token operator"><</span>name_of_remote_storage<span class="token operator">></span></span></code></pre></div> <p>It will push directly to the remote storage you defined in the command above.</p> <h3 id="whats-the-current-recommended-way-to-automate-hyperparameter-search-when-using-dvc-pipelines" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/829803720190590986" target="_blank" rel="nofollow noopener noreferrer">What's the current recommended way to automate hyperparameter search when using DVC pipelines?</a><a href="#whats-the-current-recommended-way-to-automate-hyperparameter-search-when-using-dvc-pipelines" aria-label="whats the current recommended way to automate hyperparameter search when using dvc pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Take a look at the new <a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">experiments feature</a>! It enables you to easily experiment with different parameter values.</p> <p>You could script a grid search pretty easily by queueing an experiment for each set of parameter values you want to try. For example:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--queue</span> <span class="token parameter variable">-S</span> <span class="token assign-left variable">alpha</span><span class="token operator">=</span><span class="token punctuation">{</span>alpha<span class="token punctuation">}</span>,beta<span class="token operator">=</span><span class="token punctuation">{</span>beta<span class="token punctuation">}</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--run-all</span> <span class="token parameter variable">--jobs</span> <span class="token number">2</span></span></code></pre></div> <p>The <code>--jobs 2</code> flag means you're running 2 queued experiments in parallel. By default, the <code>--run-all</code> flag runs 1 queued experiment at a time.</p> <p>Then you can compare the results with <a href="https://dvc.org/doc/command-reference/exp/show"><code>dvc exp show</code></a>.</p> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ─────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>avg_prec<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>roc_auc<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>train.n_est<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>train.min_split<span class="token hide">**</span></span> </span> ─────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 0.56191 0.93345 50 2 master 0.55259 0.91536 50 2 ├── exp-bfe64 0.57833 0.95555 50 8 ├── exp-b8082 0.59806 0.95287 50 64 ├── exp-c7250 0.58876 0.94524 100 2 ├── exp-b9cd4 0.57953 0.95732 100 8 ├── exp-98a96 0.60405 0.9608 100 64 └── exp-ad5b1 0.56191 0.93345 50 2 </span> ───────────────────────────────────────────────────────────────────</code></pre></div> <p>We are working on developing experiments to have features or documented patterns explicitly for grid search support, so definitely <a href="https://github.com/iterative/dvc/issues/4283" target="_blank" rel="nofollow noopener noreferrer">share any feedback</a> to help drive the future direction of that!</p> <h3 id="when-importinggetting-data-from-a-repo-how-do-i-provide-credentials-to-the-source-repo-remote-storage-without-saving-it-into-that-git-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/830021022337073185" target="_blank" rel="nofollow noopener noreferrer">When importing/getting data from a repo, how do I provide credentials to the source repo remote storage without saving it into that Git repo?</a><a href="#when-importinggetting-data-from-a-repo-how-do-i-provide-credentials-to-the-source-repo-remote-storage-without-saving-it-into-that-git-repo" aria-label="when importinggetting data from a repo how do i provide credentials to the source repo remote storage without saving it into that git repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There's a bit of context behind this question that might give it more meaning. Here's the background information given by @EdAb in Discord:</p> <hr> <p>I set up a private GitHub repo to be a data registry and I have set up a private Azure remote where I have pushed some datasets.</p> <p>I am now trying to read those datasets from another repository ("my-project-repo"), using <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> (e.g. <a href="https://dvc.org/doc/command-reference/get#-registry-repo"><code>dvc get [email protected]:data-registry-repo.git path/data.csv</code></a>) but I get this error:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">ERROR: failed to get <span class="token string">'path/data.csv'</span> from <span class="token string">'[email protected]:data-registry-repo.git'</span> - Authentication to Azure Blob Storage via default credentials <span class="token punctuation">(</span>https://azuresdkdocs.blob.core.windows.net/<span class="token variable">$web</span>/python/azure-identity/1.4.0/azure.identity.html<span class="token comment">#azure.identity.DefaultAzureCredential) failed.</span> Learn <span class="token function">more</span> about configuration settings at <span class="token operator"><</span>https://man.dvc.org/remote/modify<span class="token operator">></span>: unable to connect to account <span class="token keyword">for</span> Must provide either a connection_string or account_name with credentials<span class="token operator">!</span><span class="token operator">!</span></code></pre></div> <hr> <p>Generally, there are two ways solve this issue:</p> <ul> <li><a href="https://dvc.org/doc/command-reference/remote/modify" target="_blank" rel="nofollow noopener noreferrer">ENV vars</a></li> <li>Setup some options using the <code>--global</code> or <code>--system</code> flags to update the DVC config</li> </ul> <p>If you're going to update the DVC config to include your cloud credentials, use the <a href="https://dvc.org/doc/command-reference/remote/modify"><code>dvc remote modify</code></a> command. Here's an example of how you can do that with Azure using the <code>--global</code> flag.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> <span class="token parameter variable">--global</span> myremote connection_string <span class="token string">'mysecret'</span></span></code></pre></div> <p>You should initialize <code>myremote</code> in the config file with <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a> and remove the URL to rely on the one that comes from the repo being imported.</p> <p>This will modify the global config file, instead of the <em>.dvc/config</em> file. You could also use the <code>--system</code> flag to modify the system file if that's necessary for your project. You can take a look at the specific <a href="https://dvc.org/doc/command-reference/config" target="_blank" rel="nofollow noopener noreferrer">config file locations here</a>.</p> <h3 id="is-there-any-way-to-ensure-that-dvc-import-uses-the-cache-from-the-config-file-and-how-can-i-keep-the-cache-consistent-for-multiple-team-members" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/827574712825413672" target="_blank" rel="nofollow noopener noreferrer">Is there any way to ensure that <code>dvc import</code> uses the cache from the config file and how can I keep the cache consistent for multiple team members?</a><a href="#is-there-any-way-to-ensure-that-dvc-import-uses-the-cache-from-the-config-file-and-how-can-i-keep-the-cache-consistent-for-multiple-team-members" aria-label="is there any way to ensure that dvc import uses the cache from the config file and how can i keep the cache consistent for multiple team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is another great question where a little context might be useful.</p> <hr> <p>I'm trying to import a dataset project called <em>dvcdata</em> into another DVC project.</p> <p>The config for <em>dvcdata</em> is:</p> <div class="gatsby-highlight" data-language="ini"><pre class="language-ini"><code class="language-ini"><span class="token section"><span class="token punctuation">[</span><span class="token section-name selector">core</span><span class="token punctuation">]</span></span> <span class="token key attr-name">remote</span> <span class="token punctuation">=</span> <span class="token value attr-value">awsremote</span> <span class="token section"><span class="token punctuation">[</span><span class="token section-name selector">cache</span><span class="token punctuation">]</span></span> <span class="token key attr-name">type</span> <span class="token punctuation">=</span> <span class="token value attr-value">symlink</span> <span class="token key attr-name">dir</span> <span class="token punctuation">=</span> <span class="token value attr-value">/home/user/dvc_cache</span> <span class="token section"><span class="token punctuation">[</span><span class="token section-name selector">'remote "awsremote"'</span><span class="token punctuation">]</span></span> <span class="token key attr-name">url</span> <span class="token punctuation">=</span> <span class="token value attr-value">s3://...</span></code></pre></div> <p>When I run <a href="https://dvc.org/doc/command-reference/import"><code>dvc import [email protected]:user/dvcdata.git my_data</code></a>, it starts to download it. I have double checked that I have pushed this config file to master and don't understand why it's not pulling the data from my cache instead of downloading the data again.</p> <hr> <p>The repo you are importing into has its own cache directory. If you want to use the same cache directory across both projects, you have to configure <em>cache.dir</em> in both projects. You also have the option to configure the <em>cache.type</em>.</p> <p>You can set up the cache dir and cache link type in your own global config and then when project 1 imports <code>dvcdata</code>, it will be cached there. Finally when project 2 imports <code>dvcdata</code>, it will just be linked or copied, depending on the config, from the cache without downloading.</p> <p>We recommend you use the <code>--global</code> or <code>--system</code> flags in the <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a> command for updating the configs globally. An example of this would be:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> <span class="token parameter variable">--global</span> cache.dir path/to/cache/</span></code></pre></div> <p>If you set up a cache that is not shared and located on a separate volume and you have a lot of data - consider also enabling symlinks as described here - <a href="https://dvc.org/doc/user-guide/large-dataset-optimization#large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">Large Data Optimizations</a></p> <p>You might also consider using the local URL of the source project to avoid the import downloading from the remote storage. That would look something like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> /home/user/dvcdata my_data</span></code></pre></div> <p>If your concern is keeping these configs consistent for multiple users on the same machine, check out <a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">the doc on shared server development</a> to get more details!</p> <h2 id="cml" style="position:relative;">CML<a href="#cml" aria-label="cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://discord.com/channels/485586884165107732/728693131557732403/827099289372983336" target="_blank" rel="nofollow noopener noreferrer">I have an ML model that retrains every 24 hours with updated data, but I do not want to create a merge request every time. I just need a nice way to look at the results. Is there a solution on how to report the results of a pipeline in Gitlab?</a></p> <p>Great question! CML doesn't currently have a feature that takes care of this, but here are a couple of solutions (only one is needed):</p> <ol> <li>Keep a separate branch with unrelated history for committing the reports.</li> <li>Keep a single report file on the repository and update it with each commit.</li> </ol> <p><a href="https://discord.com/channels/485586884165107732/728693131557732403/818450988084101160" target="_blank" rel="nofollow noopener noreferrer">I've run into an error trying to get CML to orchestrate runs in my AWS account. It doesn't seem to be a permissions issue as the <code>AWSEc2FullAccess</code> policy seems to have worked, but I can't see the security group. What could be going wrong?</a></p> <p>Check to make sure you are deploying to the correct region. Use the argument <code>--cloud-region <region></code> (<code>us-west</code> for example) to mark the region where the instance is deployed.</p> <p><a href="%5Bhttps://discord.com/channels/485586884165107732/728693131557732403/818450988084101160">Head to these docs</a> for more information on the optional arguments that the CML runner accepts.</p> <p>Until next month…</p> <p><img src="https://media.giphy.com/media/XcAa52ejGuNqdb5SFQ/giphy.gif" alt="You Got This Hedgehog GIF by MOODMAN"></p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered and contribute to the MLOps community! 🚀</p>https://dvc.org/blog/may-21-dvc-heartbeathttps://dvc.org/blog/may-21-dvc-heartbeatFri, 21 May 2021 00:00:00 GMT<h1 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>It's been another month full of community goodness and we are grateful! Let's get right to it!</p> <p><img src="https://media.giphy.com/media/jmqWAjoxFCxJNHD2Kz/giphy.gif" alt="Thank you"></p> <h3 id="curvenote-with-dvc-tutorials" style="position:relative;">Curvenote with DVC tutorials<a href="#curvenote-with-dvc-tutorials" aria-label="curvenote with dvc tutorials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Interested in versioning your data AND your notebooks?<br> <a href="https://twitter.com/stevejpurves" target="_blank" rel="nofollow noopener noreferrer">Steve Purves</a> CTO and co-founder of <a href="https://curvenote.com/" target="_blank" rel="nofollow noopener noreferrer">Curvenote</a> gave a three-part tutuorial on integrating DVC and Curvenote for creating reproducible, collaborative version control for data scientists. The videos are beginner accessible with tips for intermediate git users. <a href="https://www.youtube.com/watch?v=OnNVbIEIO7A" target="_blank" rel="nofollow noopener noreferrer">Access the videos here.</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3a00f0b45348e1b6f411aed445cf2c8e/03346/curvenote-dvc-integration.jpg" alt="curvenote dvc integration" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>DVC and Curvenote for the version control win!</em></p> <h3 id="cml-with-jenkins-in-dagshub" style="position:relative;">CML with Jenkins in DAGsHub<a href="#cml-with-jenkins-in-dagshub" aria-label="cml with jenkins in dagshub permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Next up, <a href="https://www.linkedin.com/in/puneeth-pai-b3b299a1/" target="_blank" rel="nofollow noopener noreferrer">Puneeth Pai</a> of <a href="https://www.thoughtworks.com/" target="_blank" rel="nofollow noopener noreferrer">Thoughtworks</a> wrote a two-part blog series with a how-to for achieving continuous machine learning using DVC pipelines with Jenkins and DAGsHub. Quoted in the article is our own <a href="https://github.com/DavidGOrtega" target="_blank" rel="nofollow noopener noreferrer">David Ortega</a>,</p> <blockquote> <p>Treating experiments like potential new features in a software project opens up many possibilities for improving our engineering practices.</p> </blockquote> <p>Check out these posts at the link below or catch Puneeth at our next <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/278163666/" target="_blank" rel="nofollow noopener noreferrer">Meetup</a> where he will be giving a high level overview of this content as well as answering questions.</p> <p> </p><section class="elp-content-holder"> <a href="https://dagshub.com/blog/in-depth-tour-of-jenkinsfile/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">CML with Jenkins in DAGsHub</h4> <div class="elp-description">The first of a two-part series on how to set up continuous machine learning using DVC pipelines with Jenkins and DAGsHub.</div> <div class="elp-link">https://dagshub.com/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-05-21/puneeth-gears-a85728868a27748e24f94f8e73f46032.png" alt="CML with Jenkins in DAGsHub"> </div> </a> </section> <p></p> <h3 id="discord-server-explosion" style="position:relative;">Discord Server Explosion<a href="#discord-server-explosion" aria-label="discord server explosion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord server</a> has exploded since last month, up 30% in membership 😱, thanks in large part to a <a href="https://towardsdatascience.com/" target="_blank" rel="nofollow noopener noreferrer"><strong>Towards Data Science</strong></a> post from <a href="https://www.linkedin.com/in/sara-a-metwalli/" target="_blank" rel="nofollow noopener noreferrer">Sara Metwalli</a> recommending <a href="https://towardsdatascience.com/9-discord-servers-for-math-python-and-data-science-you-need-to-join-today-34214b93d6b8" target="_blank" rel="nofollow noopener noreferrer"><strong>9 Discord Servers for Math, Python, and Data Science You Need to Join Today.</strong></a></p> <p>Sara encourages readers to connect, learn and get inspired. 🚀 Thanks Sara! We're on board with that! Rest assured our growing team is hard at work creating content, improving tools and working on new tools 😶🤗 to continue to grow and serve our MLOps community!</p> <h1 id="in-other-mlops-news-" style="position:relative;">In Other MLOps News …<a href="#in-other-mlops-news-" aria-label="in other mlops news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <h2 id="learning-opportunities" style="position:relative;">Learning Opportunities<a href="#learning-opportunities" aria-label="learning opportunities permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/AndrewYNg" target="_blank" rel="nofollow noopener noreferrer">Andrew NG</a> of <a href="https://twitter.com/DeepLearningAI_" target="_blank" rel="nofollow noopener noreferrer">Deep Learning AI</a> and <a href="https://www.coursera.org/" target="_blank" rel="nofollow noopener noreferrer">Coursera</a> fame has just released a new course specializing in MLOps, called <a href="https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops?utm_campaign=20210423-mlep-1-program-email-mlep-launch&utm_medium=institutions&_hsmi=126760441&_hsenc=p2ANqtz-9wSUanrnpyWNavtaCEzBLVpDXwatEig_ahaksJQhZO6dKkLRykfOxRwkpAZiipxWej4xs1uQgrXl-JCgB0M-Ha_vCUvEqaswIVZQhNd-jUDsE8SJs&utm_source=deeplearning-ai" target="_blank" rel="nofollow noopener noreferrer">Machine Learning Engineering for Production (MLOps) Specialization</a>. The course "combines the foundational concepts of machine learning with the functional expertise of modern software development and engineering roles." Methodologies and capabilities of MLOps are introduced while addressing the challenges and consequences of machine learning engineering in production. I'm signed up! 🙋🏻‍♀️ How 'bout you?</p> <p> </p><section class="elp-content-holder"> <a href="https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops?utm_campaign=20210423-mlep-1-program-email-mlep-launch&utm_medium=institutions&_hsmi=126760441&_hsenc=p2ANqtz-9wSUanrnpyWNavtaCEzBLVpDXwatEig_ahaksJQhZO6dKkLRykfOxRwkpAZiipxWej4xs1uQgrXl-JCgB0M-Ha_vCUvEqaswIVZQhNd-jUDsE8SJs&utm_source=deeplearning-ai" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Machine Learning Engineering for Production (MLOps) Specialization</h4> <div class="elp-description">Andrew Ng's new course in Coursera providing the foundation to successful and efficient MLOps</div> <div class="elp-link">https://www.coursera.org/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-05-21/andrew-ng-626e2a8303ad876772f9bb809c95cf54.png" alt="Machine Learning Engineering for Production (MLOps) Specialization"> </div> </a> </section> <p></p> <p>Next for your learning pleasure, <a href="https://twitter.com/s_scardapane" target="_blank" rel="nofollow noopener noreferrer">Simone Scardapane</a> is in the process of fulfilling a "personal challenge" to create a PhD course for <a href="https://twitter.com/s_scardapane/status/1389240445788643329?s=20" target="_blank" rel="nofollow noopener noreferrer"><strong>Reproducible Deep Learning</strong></a> that includes the use of open source tools including our own DVC! <a href="https://github.com/sscardapane/reprodl2021" target="_blank" rel="nofollow noopener noreferrer">Head to the link</a> to star the repo and cheer him on. We will be! 🙌🏼</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 603px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ab974d16e4e3484ec253af5b5feba427/39600/reproducedl.png" alt="reproducedl" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Reproducible Deep Learning PhD Course</em></p> <p><a href="https://twitter.com/s_scardapane" target="_blank" rel="nofollow noopener noreferrer">Simone Scardapane</a> is in the process of fulfilling a "personal challenge" to create a PhD course for <a href="https://twitter.com/s_scardapane/status/1389240445788643329?s=20" target="_blank" rel="nofollow noopener noreferrer"><strong>Reproducible Deep Learning</strong></a> that includes the use of open source tools including our own DVC! <a href="https://github.com/sscardapane/reprodl2021" target="_blank" rel="nofollow noopener noreferrer">Head to the link</a> to star the repo and cheer him on. We will be! 🙌🏼</p> <p>You see what I did there, right? <strong>Reproducible</strong>… <strong>Deep Learning</strong>…<br> Get it? Layers of wit people. I learned from the best! Just wanted to make sure you were paying attention!</p> <p><img src="https://media.giphy.com/media/6ra84Uso2hoir3YCgb/giphy.gif" alt="Marvel Studios Smile GIF by Disney+"></p> <h1 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>We've hit 30 team members! Our team is distributed all over the world and has grown so much that we now have two all-hands meetings! Affectionately called UTC + and UTC -, these meetings honor all our different time zones while allowing the other group to watch via recording when they are awake! You know we're all about solving complicated problems. 💪🏼</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4773d87f73f3147561c9517e44c3ce7a/39600/team-map.png" alt="team map" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Our team</em></p> <h2 id="new-team-members" style="position:relative;">New Team Members<a href="#new-team-members" aria-label="new team members permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/svetlana-sachkovskaya/" target="_blank" rel="nofollow noopener noreferrer">Svetlana Sachkovskaya</a> is originally from Belarus, but is currently living in Poland. She has been a full stack developer for over seven years. She loves traveling, meeting new people and is excited to work on open source software. In her spare time you may find her dancing the tango! 💃🏻 Welcome Sveta!</p> <p>Exemplifying our diverse team in one fell swoop, <a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">Casper da Costa-Luis</a> has lived in three continents. He has been working on DVC for a couple of years and is a long-standing contributor to open source. He now joins us on the CML & Docs teams after completing his PhD in Medical Imaging. Fun facts about Casper include his becoming the U18 chess champion of Kenya when he was 14 and being a qualified SCUBA diver. 🤿</p> <p><a href="https://github.com/iesahin" target="_blank" rel="nofollow noopener noreferrer">Emre Şahin</a> joins us on the DVC team as a technical writer/ML enthusiast/AI dreamer/tutorial builder from Instanbul, Turkey. A self-described zealot for technologies, Emre has worked in many development/ML related projects and has been programming in Python since v. 1.7. We are excited for Emre to bring you excellent technical content! ✍🏼</p> <p><a href="https://www.linkedin.com/in/tapa-dipti-sitaula/" target="_blank" rel="nofollow noopener noreferrer">Tapa Dipti Sitaula</a> joins us as a Senior Product Engineer from Nepal. She previously worked as a Principal Engineer at a tech start up in India and has worked in various capacities in her career from engineering to project management and communications. Her interests include learning languages and breaking gender stereotypes. We're right there with you Tapa! 🚀</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>And we're still hiring!</p> <p><a href="https://weworkremotely.com/company/iterative" target="_blank" rel="nofollow noopener noreferrer"><strong>Check out our three open roles</strong></a> for:</p> <ul> <li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-front-end-engineer" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Frontend Engineer</strong></a></li> <li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Sofware Engineer - Open Source, Dev Tools</strong></a> and</li> <li><a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong>.</a></li> </ul> <p>Does this sound like you or someone you know? Be in touch!</p> <h2 id="dvcteam-conference-talks" style="position:relative;">DVCTeam Conference Talks<a href="#dvcteam-conference-talks" aria-label="dvcteam conference talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://mlrepa.com/" target="_blank" rel="nofollow noopener noreferrer">ML Repa Week</a> took place last month and team members gave three great talks. <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> gave a talk on data versioning and machine learning experiments on top of Git. <a href="https://www.linkedin.com/in/drelleobrien/" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a> gave a talk on automating machine learning with Github action and GitLab CI. And finally, <a href="https://www.linkedin.com/in/mnrozhkov/" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> gave a talk on setting up the workflow for machine learning batch scoring applications using DVC, MLflow and Airflow. Be sure to check out all three talks and other great talks from the week long Conference.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.youtube.com/watch?v=OD2KiIOMeMw&list=PLlxErbAvYYLDRP6cHtVP76f2g5Yoh6c5R&index=2" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC: Data Versioning and ML Experiments on Top of Git</h4> <div class="elp-description">Dmitry Petrov's talk at ML Repa Week on using DVC as an extension of Git for data versioning and machine learning experiments</div> <div class="elp-link">http://ml-repa.ru/en/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-05-21/dmitry-ml-repa-week-fb9ce5758f68b866a4999eefbd7862ad.png" alt="DVC: Data Versioning and ML Experiments on Top of Git"> </div> </a> </section> <p></p> <p> </p><section class="elp-content-holder"> <a href="https://youtu.be/tOo98CtiDJg" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Automating Machine Learning with GitHub Actions & GitLab CI</h4> <div class="elp-description">Elle O'Brien's conference talk about how to use GitHub actions or GitLab CI to provide automation for your machine learning projects</div> <div class="elp-link">http://ml-repa.ru/en</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-05-21/elle-ml-repa-week-a7b3d9a03df85303a074882054faec32.png" alt="Automating Machine Learning with GitHub Actions & GitLab CI"> </div> </a> </section> <p></p> <p> </p><section class="elp-content-holder"> <a href="https://youtu.be/PYzvLc7o7u0" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Workflow & MLOps for Batch Scoring Applications with DVC, MLflow and Airflow</h4> <div class="elp-description">Mikhail Rozhkov's talk on how to set up a workflow for batch scoring applications integrating DVC, MLflow and Airlow </div> <div class="elp-link">http://ml-repa.ru/en</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-05-21/mikhail-ml-repa-week-66dee03e41e7efb2033190ee966e4bc4.png" alt="Workflow & MLOps for Batch Scoring Applications with DVC, MLflow and Airflow"> </div> </a> </section> <p></p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Don't miss our <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/277245660" target="_blank" rel="nofollow noopener noreferrer">Meetup</a> May 27th at 3:00pm UTC, where we will hear from Puneeth Pai as mentioned above 👆🏽, as well as another user putting DVC and CML into action on his team, and finally from David Ortega discussing CML pull requests! Bring your questions! We're here to help!</p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🦉 I'm really enjoying reading through <a href="https://twitter.com/DVCorg">@DVCorg</a>'s documentation and use cases for operationalizing machine learning models.<a href="https://t.co/9k8tSfXbMK">https://t.co/9k8tSfXbMK</a><br><br>If you've ever tried to put a model in production, these concepts will resonate. Check out their open-source project on <a href="https://twitter.com/github">@Github</a>! ✨ <a href="https://t.co/zsSdlivwZk">pic.twitter.com/zsSdlivwZk</a></p>— 👩‍💻 Paige Bailey (@DynamicWebPaige) <a href="https://twitter.com/DynamicWebPaige/status/1394389238750326787">May 17, 2021</a></blockquote> <p>That's quite a shout out! Thanks to <a href="https://twitter.com/JorgeOrpinel" target="_blank" rel="nofollow noopener noreferrer">Jorge Orpinel</a> and team for always raising the bar on our docs! Until next month! 👩🏽‍💻</p> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/experiment-refshttps://dvc.org/blog/experiment-refsMon, 19 Apr 2021 00:00:00 GMT<p>One of the main features provided by DVC is the ability to version machine learning (ML) pipelines and experiments using Git commits. While this works very well for versioning mature projects and models, for projects under active development that may require generating hundreds of experiments or more in a single day, typical Git workflows can be difficult to work with. This type of rapid experimentation may appear to fit nicely with the concept of Git feature branches, but a Git repository with such large numbers of branches will eventually become too unwieldy to manage.</p> <p>In DVC 2.0, we’ve introduced a new feature set aimed at simplifying the versioning of lightweight ML experiments. DVC now provides a series of <a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp</code></a> commands which allow you to easily generate new experiments with modified hyperparameters, and to quickly compare their results. In this post, we’ll show how DVC leverages the power of Git references to track each experiment, while also completely abstracting away the need for you to manually manage a potentially unlimited number of Git feature branches or tags.</p> <p><em>Note: This post mainly focuses on the “How?” side of DVC 2.0 experiments. For a great overview of the “What?” check out our <a href="https://dvc.org/blog/dvc-2-0-release">2.0 release post</a> and our <a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">Get Started: Experiments</a> guide.</em></p> <h2 id="experiments-in-dvc-20" style="position:relative;">Experiments in DVC 2.0<a href="#experiments-in-dvc-20" aria-label="experiments in dvc 20 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>At the heart of the new experiments feature is the <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> command. Whenever a pipeline is executed with <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a>, the results will be automatically tracked by DVC as a single “experiment”. DVC will track everything in your workspace as a part of the experiment, including unstaged changes made prior to execution. This means that DVC experiments can be used to test the result of changes to DVC-tracked data or pipeline parameters, as well as changes to Git-tracked code.</p> <p><img src="https://dvc.org/2021-04-19/exp-run-0e62e88195f222135b89806a7e74915d.gif" alt="Example experiment run" title="Example experiment run"></p> <p><em>Note: You can follow along with the commands used in this example and throughout this post, using our <a href="https://github.com/iterative/example-get-started" target="_blank" rel="nofollow noopener noreferrer">example-get-started</a> repository.</em></p> <p>Now let’s take a deeper look into what actually happened when we ran our experiment. Starting from the latest commit in our repository’s <code>master</code> branch, we invoked <a href="https://dvc.org/doc/command-reference/exp/run#--set-param"><code>dvc exp run --set-param</code></a> to generate a new experiment with the specified parameter value. DVC then reproduced our pipeline as if we had manually edited our <code>params.yaml</code> to contain that parameter change (setting <code>featurize.max_features</code> to <code>2000</code>), and then saved the results in a new experiment named <code>exp-26220</code>.</p> <p>Returning DVC users will likely be familiar with the typical Git+DVC workflow of reproducing your pipeline, staging the results in Git, and then Git committing those changes:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span></span></code></pre></div> <p>This workflow is now essentially automated within our single <code>exp run</code> command, with one key difference. Rather than saving the results in a Git <em>branch</em>, the results are saved in a custom Git <em>reference</em>.</p> <h2 id="what-is-a-git-reference" style="position:relative;">What is a Git reference?<a href="#what-is-a-git-reference" aria-label="what is a git reference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>A Git reference (or ref) is a named reference to a Git commit. References are addressed via a pathname starting with <code>refs/</code>. Git branches and tags are actually just references which are stored in the <code>refs/heads</code> and <code>refs/tags</code> namespaces respectively. In our repo, we can see that:</p> <p>The tip of our <code>master</code> branch is commit <code>f137703</code>:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show master </span>commit f137703af59ba1b80e77505a762335805d05d212 (HEAD -> master) Author: dberenbaum <[email protected]> Date: Wed Apr 14 14:31:54 2021 -0400 Run experiments tuning random forest params</code></pre></div> <p><code>master</code> itself is a Git ref (<code>refs/heads/master</code>) pointing to that commit:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref master </span>f137703af59ba1b80e77505a762335805d05d212 refs/heads/master</code></pre></div> <h2 id="what-exactly-is-a-dvc-experiment" style="position:relative;">What exactly is a DVC experiment?<a href="#what-exactly-is-a-dvc-experiment" aria-label="what exactly is a dvc experiment permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now, going back to our experiment run, we see that DVC has generated and saved an experiment named <code>exp-26220</code>. We can even use that name freely within DVC commands as if it was a Git branch or tag name:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> master exp-26220 </span>Path Metric Old New Change scores.json avg_prec 0.60405 0.58589 -0.01817 scores.json roc_auc 0.9608 0.945 -0.01581 <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc diff</span> master exp-26220 </span>Modified: data/features/ data/features/test.pkl data/features/train.pkl model.pkl prc.json roc.json scores.json files summary: 0 added, 0 deleted, 0 renamed, 6 modified</code></pre></div> <p>However, Git tells us that there is no branch or tag named <code>exp-26220</code>, and we cannot use that name in Git porcelain commands:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git tag</span> <span class="token parameter variable">-l</span> </span>0-git-init 1-dvc-init 10-bigrams-experiment 11-random-forest-experiments 2-track-data 3-config-remote 4-import-data 5-source-code 6-prepare-stage 7-ml-pipeline 8-evaluation 9-bigrams-model baseline-experiment bigrams-experiment random-forest-experiments <span class="token line"><span class="token input">$ </span><span class="token command">git</span> branch <span class="token parameter variable">-l</span> </span>* master <span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> exp-26220 </span>error: pathspec 'exp-26220' did not match any file(s) known to git</code></pre></div> <p><em>Note: The Git CLI is divided into <a href="https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain" target="_blank" rel="nofollow noopener noreferrer">two sets of commands</a>: the commonly used user-friendly “porcelain” commands (like <code>git checkout</code>) and the lower level “plumbing” commands.</em></p> <p>This naturally begs the question, “What is <code>exp-26220</code>?”</p> <p>The answer is simple, it’s a custom DVC Git ref pointing to a Git commit:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref exp-26220 </span>c42f48168830148b946f6a75d1bdbb25cda46f35 refs/exps/f1/37703af59ba1b80e77505a762335805d05d212/exp-26220</code></pre></div> <p><em>Note: that <a href="https://dvc.org/doc/command-reference/exp/show#--sha"><code>dvc exp show --sha</code></a> can be used to view Git commit SHAs for experiments. Using DVC experiments should never require you to use any of the low-level Git plumbing commands like <code>git show-ref</code>.</em></p> <p>If we examine the experiment commit itself, we can see that it is just a regular commit object that contains our hyperparameter change and the results of the run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> show c42f481 </span>commit c42f48168830148b946f6a75d1bdbb25cda46f35 (refs/exps/f1/37703af59ba1b80e77505a762335805d05d212/exp-26220) Author: Peter Rowlands <[email protected]> Date: Mon Apr 19 04:24:04 2021 +0000 dvc: commit experiment 262206295221319fe5e8ca8a9854d6eb93ec0931fb377488910304cf5ed55f84 diff --git a/dvc.lock b/dvc.lock index 0e92326..d81fe2b 100644 --- a/dvc.lock +++ b/dvc.lock @@ -30,19 +30,19 @@ stages: size: 2455 params: params.yaml: - featurize.max_features: 3000 + featurize.max_features: 2000 featurize.ngrams: 2 ... diff --git a/scores.json b/scores.json index 27f6dab..8270914 100644 --- a/scores.json +++ b/scores.json @@ -1,4 +1,4 @@ { - "avg_prec": 0.6040544652105823, - "roc_auc": 0.9608017142900953 + "avg_prec": 0.5858888885424922, + "roc_auc": 0.944996664954421 } ...</code></pre></div> <h2 id="dvc-and-custom-git-refs" style="position:relative;">DVC and custom Git refs<a href="#dvc-and-custom-git-refs" aria-label="dvc and custom git refs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In DVC 2.0, we now use the custom <code>refs/exps</code> namespace for storing DVC experiments in Git. Under the hood, using Git refs allows us to keep using all of the versioning capabilities provided by Git, without polluting your repository with actual Git branches and tags. Since the user-friendly Git porcelain commands (like <code>git checkout</code> and <code>git diff</code>) only resolve branches and tags (and will ignore custom references), DVC experiments are essentially hidden from your Git repository (and only visible to DVC commands).</p> <p>Even though the experiment refs themselves are “invisible” to Git porcelain commands, Git commit SHAs for experiments can be used in any Git command. This allows you to leverage the power of tools like <code>git diff</code> to compare things like code changes between a DVC experiment and any other Git commit (meaning you can even compare experiment commit SHAs to Git branches or tags).</p> <p>Likewise, for tools which provide a GUI on top of Git, experiments will be hidden from your repository in typical use cases:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/463a54c559f3d4d4780e6a20d0acad93/39600/gitk-branches-tags.png" alt="gitk --branches --tags example" title="gitk --branches --tags" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em><code>gitk --branches --tags</code></em></p> <p>Tools which provide the capability to displaying all Git refs (including custom namespaces) can also be used to view experiments as if they were Git branches:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0fae615939a6b1d30cce0339e5301a7f/39600/gitk-all.png" alt="gitk --all example screenshot" title="gitk --all" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em><code>gitk --all</code></em></p> <p>Experiments are also completely local (since custom refs are not transferred to or from Git remotes on <code>git push</code> and <code>git pull</code>), meaning that even if you run thousands of experiments locally, you do not need to worry about accidentally polluting your team’s upstream Github or Gitlab repository with those experiments. However, individual DVC experiments can be explicitly shared via remote Git repositories using the <a href="https://dvc.org/doc/command-reference/exp/push"><code>dvc exp push</code></a> and <a href="https://dvc.org/doc/command-reference/exp/pull"><code>dvc exp pull</code></a> commands. Regular Git branches can also be created from experiments can via <a href="https://dvc.org/doc/command-reference/exp/branch"><code>dvc exp branch</code></a>.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Prior to version 2.0, DVC already provided a method for versioning (and reproducing) ML pipelines with Git. By extending DVC's existing capabilities with the functionality offered by custom Git references, we've created a new framework for users to easily generate and track their experiments. And when used in conjunction with the other new features provided in 2.0 (like <a href="https://dvc.org/doc/command-reference/exp/run#checkpoints" target="_blank" rel="nofollow noopener noreferrer">checkpoints versioning</a> and <a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating" target="_blank" rel="nofollow noopener noreferrer">pipeline parametrization</a>), DVC can now fulfill certain use cases which were unfeasible with typical pre-2.0 DVC + Git workflows, including hyperparameter tuning and deep learning scenarios.</p> <p>We hope that whether you are new to DVC or a long time user, you will try out the new capabilities provided in our 2.0 release. And as always, if you have any questions, comments or suggestions, please feel free to connect with the DVC community on <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Discourse</a>, <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a> and <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>https://dvc.org/blog/april-21-dvc-heartbeathttps://dvc.org/blog/april-21-dvc-heartbeatFri, 16 Apr 2021 00:00:00 GMT<h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We're starting with the community this month because it has been overflowing with great content from our users. It's like we're on a sugar high!</p> <p><img src="https://media.giphy.com/media/oiGCnybFPh6Q8/giphy.gif" alt="Sugar High"></p> <h3 id="goku-mohandas-new-lessons" style="position:relative;">Goku Mohandas' New Lessons!<a href="#goku-mohandas-new-lessons" aria-label="goku mohandas new lessons permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>First up, <a href="https://twitter.com/GokuMohandas" target="_blank" rel="nofollow noopener noreferrer">Goku Mahandas</a> of <a href="https://madewithml.com/" target="_blank" rel="nofollow noopener noreferrer">Made With ML</a> has added this <a href="https://madewithml.com/courses/mlops/versioning/" target="_blank" rel="nofollow noopener noreferrer">Versioning Lesson</a> to the popular <strong>MLOps Course</strong> using DVC.<br> It's RT'ing around the MLOps Twitter space like hotcakes! 🥞</p> <p> </p><section class="elp-content-holder"> <a href="https://madewithml.com/courses/mlops/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLOps - Versioning Code, Data and Models</h4> <div class="elp-description">Using DVC to version data and models for reproducibility in a local storage use case</div> <div class="elp-link">https://madewithml.com/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-04-16/made-with-ml-logo-aecae356305b60ef1f8a39aa3a167d05.png" alt="MLOps - Versioning Code, Data and Models"> </div> </a> </section> <p></p> <h3 id="ryzal-kamis-tutorial" style="position:relative;">Ryzal Kamis Tutorial<a href="#ryzal-kamis-tutorial" aria-label="ryzal kamis tutorial permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/ryzalkamis/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ryzal Kamis</strong></a> of <a href="https://twitter.com/AISingapore" target="_blank" rel="nofollow noopener noreferrer">AI Singapore</a> has created an <a href="https://makerspace.aisingapore.org/2021/04/data-versioning-for-cd4ml-part-2/" target="_blank" rel="nofollow noopener noreferrer"><strong>in depth tutorial</strong></a> on data versioning using DVC. This is a follow up article to his <a href="https://dvc.org/blog/september-20-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">tutorial that was featured in the September Heartbeat.</a> Thanks Ryzal for this detailed work! 🙏🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://makerspace.aisingapore.org/2021/04/data-versioning-for-cd4ml-part-2/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data Versioning for CD4ML - Part 2</h4> <div class="elp-description">Complete tutorial for beginning continuous integration, automated testing and versioning, experiment tracking, reproducing the model training pipeline and creating a Flask app for predictive use of the model </div> <div class="elp-link">https://makerspace.aisingapore.org/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-04-16/ai-singapore-logo-edbd4b64f8041fff792efadac70c5f57.jpeg" alt="Data Versioning for CD4ML - Part 2"> </div> </a> </section> <p></p> <h2 id="dvc-used-to-help-in-research-published-in-the-international-journal-of-molecular-sciences-" style="position:relative;">DVC used to help in Research published in the International Journal of Molecular Sciences 🧑🏻‍🔬<a href="#dvc-used-to-help-in-research-published-in-the-international-journal-of-molecular-sciences-" aria-label="dvc used to help in research published in the international journal of molecular sciences permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://www.linkedin.com/in/antonkulaga/" target="_blank" rel="nofollow noopener noreferrer">Anton Kulaga</a> and his team used DVC pipeline tracking in their research that selects genes connected with maximum lifespan in mammals. You can check out the <a href="https://www.mdpi.com/1422-0067/22/3/1073" target="_blank" rel="nofollow noopener noreferrer">paper here</a> as well as their <a href="https://docs.google.com/document/d/1kI1f62z0Opt8KD4Mf1yrYKftYLOZel3EjbfjDJiQQzg/edit" target="_blank" rel="nofollow noopener noreferrer">pipeline use case here</a> and their <a href="https://github.com/antonkulaga/yspecies" target="_blank" rel="nofollow noopener noreferrer">GitHub repository.</a></p> <p>See the diagram of the research below.👇🏼</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/be14fd9336d2db0a3e6ae40ba77b965f/39600/longevity-study.png" alt="longevity study" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>longevity research diagram</em></p> <h2 id="dagshub-️-dvc-colab-notebook" style="position:relative;">DAGsHub ❤️ DVC Colab Notebook<a href="#dagshub-%EF%B8%8F-dvc-colab-notebook" aria-label="dagshub ️ dvc colab notebook permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The DevRel team at <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> made <a href="https://colab.research.google.com/drive/1JJIwAH0TBSY49um5s2FD0GEA6bw3SKrd#scrollTo=cjbAYZDfB3JB" target="_blank" rel="nofollow noopener noreferrer">this cool notebook</a> that trains a model to classify email as either 'Ham' or 'Spam.' The notebook shows how to integrate DAGsHub remote storage with DVC to track code and data files.</p> <p><img src="https://media.giphy.com/media/7pLv68ItwBaHS/giphy.gif" alt="Robin Williams Thats The Good Stuff GIF"></p> <h2 id="en-español" style="position:relative;">En Español<a href="#en-espa%C3%B1ol" aria-label="en español permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Yurely Camacho of <a href="http://opensciencelabs.org/" target="_blank" rel="nofollow noopener noreferrer">Open Science Labs</a> created this blog post on DVC and the advantages of using it for our Spanish speaking friends! ¡Olé!💃🏻</p> <p> </p><section class="elp-content-holder"> <a href="http://opensciencelabs.org/2021/03/22/que-es-el-data-version-control-y-por-que-es-necesario-que-tu-equipo-sepa-como-utilizarlo/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Qué es el Data Version Control y por qué es necesario que tu equipo sepa cómo utilizarlo</h4> <div class="elp-description">Advantages to using DVC for data version control and team collaboration</div> <div class="elp-link">http://opensciencelabs.org/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-04-16/open-science-labs-logo-9d053dacd1ee0a6a63a146718d12b20d.png" alt="Qué es el Data Version Control y por qué es necesario que tu equipo sepa cómo utilizarlo"> </div> </a> </section> <p></p> <h2 id="dvc-news" style="position:relative;">DVC News<a href="#dvc-news" aria-label="dvc news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Pick a card, any card… You have not 1, but 3 interviews and talks to choose from this Heartbeat:</p> <ul> <li><a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov's</strong></a> <a href="https://opencv.org/opencv-ai-for-entrepreneurs-unveils-new-podcast-episode/" target="_blank" rel="nofollow noopener noreferrer">interview</a> with <a href="https://www.linkedin.com/in/anna-petrovicheva-44b24673/" target="_blank" rel="nofollow noopener noreferrer">Anna Petrovicheva</a> on <a href="https://twitter.com/opencvlibrary" target="_blank" rel="nofollow noopener noreferrer">Open CV</a></li> <li>Dmitry's <a href="https://www.youtube.com/watch?v=g3i-9Gk8BiA" target="_blank" rel="nofollow noopener noreferrer">interview</a> with <a href="https://twitter.com/dswharshit" target="_blank" rel="nofollow noopener noreferrer">Harshit Tyagi</a> of <a href="https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ" target="_blank" rel="nofollow noopener noreferrer">Data Science with Harshit</a>, and</li> <li>Dmitry's <a href="https://www.youtube.com/watch?v=J8mCr3wVgdA" target="_blank" rel="nofollow noopener noreferrer">talk</a> at the <a href="https://twitter.com/TMLS_TO" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Society</a></li> </ul> <p>Spoiler alert ⚠️: You can't choose wrong!</p> <p><img src="https://media.giphy.com/media/GXrcAztzRX9kI/giphy.gif" alt="Cards GIF"></p> <h2 id="and-we-keep-on-growing-our-worldwide-team-" style="position:relative;">And we keep on growing our worldwide team! 🌏<a href="#and-we-keep-on-growing-our-worldwide-team-" aria-label="and we keep on growing our worldwide team permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We are getting to the point where our new hires could take up our whole Heartbeat! 😅🚀💗</p> <p><a href="https://www.linkedin.com/in/julianna-galvan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Julie Galvan</strong></a> joins our team from Houston, Texas as an engineer. She is focused on web development. In her free time Julie loves reading, especially fantasy fiction (Harry Potter #6 was fav) and paper crafting. Welcome Julie!🎉</p> <p><a href="https://www.linkedin.com/in/matt-seddon/" target="_blank" rel="nofollow noopener noreferrer"><strong>Matt Seddon</strong></a> joins us from Down Under as a DVC front-end engineer! 🦘 He lives in Kiama, a small town on the East Coast of Australia. Originally from Scotland, when he's not programming he likes to spend time with his family away from screens (😅🙌🏼) and he volunteers for the state emergency service. 🤲🏼</p> <p><a href="https://www.linkedin.com/in/gaoyanxiang/" target="_blank" rel="nofollow noopener noreferrer"><strong>Yanxiang Gao</strong></a> (who graciously allows us to call him Gao) joins us from Hangzhou, China as new DVC engineer. Gao has a Masters in Physics and has previously worked as a Machine Learning engineer in Chinese tech companies using DVC. He has been a long time contributor to DVC and we are so glad to have him on the team now!🎉</p> <p><a href="https://www.linkedin.com/in/danielkharitonov/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Kharitonov</strong></a> joins us from Stanford, California as a Technical Product Manager Intern. Daniel graduated from Stanford with Masters CS / AI and PhD MS&E degrees. His previous industry roles involved working on core routing products at juniper.net, medical image augmentation with GANs, and synth data generation for autonomous vehicles. Welcome to the team Daniel! 🙌🏼</p> <p>Last but not least joining just this week, <a href="https://www.linkedin.com/in/milecia/" target="_blank" rel="nofollow noopener noreferrer"><strong>Milecia McGregor</strong></a> joins us as a Developer Advocate from Tulsa, Oklahoma. Milecia has a background in mechanical and aerospace engineering, some machine learning on autonomous vehicles, and basically everything that the web touches. She also practices kung fu in her free time.🥋🙇🏻‍♀️ We think that's "Oklahoma, OK!" 👌🏼</p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Even with all our new hires, we're still building!</p> <p><a href="https://weworkremotely.com/company/iterative" target="_blank" rel="nofollow noopener noreferrer"><strong>Check out our three open roles</strong></a> for:</p> <ul> <li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-frontend-engineer" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Frontend Engineer</strong></a></li> <li><a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Sofware Engineer - Open Source, Dev Tools</strong></a> and</li> <li><a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong>.</a></li> </ul> <p>Does this sound like you or someone you know? Be in touch!</p> <h2 id="next-meetup" style="position:relative;">Next Meetup<a href="#next-meetup" aria-label="next meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Don't miss our <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/277245660" target="_blank" rel="nofollow noopener noreferrer">Meetup</a> April 28th at 3:00pm UTC, where we will be demo-ing Pipelines and CML! Bring your questions! We're here to help!</p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">DVC is an amazing tool. Great milestone. <br><br>It already removes a lot of headaches in my <a href="https://twitter.com/hashtag/MachineLearning?src=hash&ref_src=twsrc%5Etfw">#MachineLearning</a> work. <br><br>But with new features, I will be even more productive :) <a href="https://t.co/pMyVXS292j">https://t.co/pMyVXS292j</a></p>— Vladimir Iglovikov (@viglovikov) <a href="https://twitter.com/viglovikov/status/1367193818152411137">March 3, 2021</a></blockquote> <p>We love removing your headaches! 🙌🏼 You're all caught up! See you at the next Community Gems 💎!</p> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/march-21-community-gemshttps://dvc.org/blog/march-21-community-gemsWed, 31 Mar 2021 00:00:00 GMT<h3 id="q-will-dvc-work-with-my-remote-cloud-storage-of-choice" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/821493606770409493" target="_blank" rel="nofollow noopener noreferrer">Q: Will DVC work with <my remote cloud storage of choice?></a><a href="#q-will-dvc-work-with-my-remote-cloud-storage-of-choice" aria-label="q will dvc work with my remote cloud storage of choice permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We recently had questions about this, specifically regarding Huawei Cloud and Backblaze B2 Storage. The answer is any cloud storage that has an S3 interface will work with DVC and both of the aforementioned do! In addition DVC works with Azure, Google Drive, GS, OSS, and SSH. <a href="https://dvc.org/doc/command-reference/remote" target="_blank" rel="nofollow noopener noreferrer">Learn more about S3 combatibility integrations and all available remote storage capabilities here.</a></p> <p>Thanks to @luke and @Samuel H from Discord for asking these questions that led to this Gem! 💎</p> <h3 id="q-i-had-understood-previously-that-dvc-was-not-suitable-for-hyperparameter-tuning-has-that-changed" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/820722752709328967" target="_blank" rel="nofollow noopener noreferrer">Q: I had understood previously that DVC was not suitable for hyperparameter tuning. Has that changed?</a><a href="#q-i-had-understood-previously-that-dvc-was-not-suitable-for-hyperparameter-tuning-has-that-changed" aria-label="q i had understood previously that dvc was not suitable for hyperparameter tuning has that changed permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes indeed! With DVC 2.0, the capabilities have evolved quite a bit! We have introduced experiments and metrics which enables you to track and compare the different runs of your models with various hyperparameters. You can check out the documents <a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">here</a> and <a href="https://dvc.org/doc/start/metrics-parameters-plots" target="_blank" rel="nofollow noopener noreferrer">here</a> to see all the details.</p> <p>Thanks to @saif3r for helping us highlight the new features in DVC!</p> <h3 id="q-is-it-possible-to-set-up-a-dvc-repo-with-pipelines-which-have-all-the-data-cache-input-output-on-another-local-location-outside-the-repo" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/819509440217874473" target="_blank" rel="nofollow noopener noreferrer">Q: Is it possible to set up a DVC repo with pipelines which have all the data (cache, input, output) on another (local) location outside the repo?</a><a href="#q-is-it-possible-to-set-up-a-dvc-repo-with-pipelines-which-have-all-the-data-cache-input-output-on-another-local-location-outside-the-repo" aria-label="q is it possible to set up a dvc repo with pipelines which have all the data cache input output on another local location outside the repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Thanks for the question @EEisbrenner!</p> <p>One solution to this would be to keep your DVC cache on your mount, and use the <code>symlink</code> cache type so all of your data would remain on that mount, but for DVC's purposes it would only deal with files that are "inside" your repo (via symlinks). Note that your data on that mount would be stored in DVC's content-addressable cache format, and not in <code>path/to/mount/foo.nc</code>. Check out the docs on <a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">how to keep DVC cache on your mount here.</a></p> <p>To actually work with <code>foo.nc</code>, you'd end up with a symlink <code>foo.nc</code> inside your git/DVC repo that points to some object in your DVC cache.<br> <a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">See these docs</a> for info on how the cache link types work. For doing the initial <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> step for your data without needing to copy it into the DVC/repo first, <a href="https://dvc.org/doc/command-reference/add#example-transfer-to-the-cache" target="_blank" rel="nofollow noopener noreferrer">check out these docs</a>.</p> <h3 id="q-my-peers-and-i-share-a-repo-where-we-have-a-folder-that-is-versioned-with-dvc-im-getting-an-error-message-when-trying-to-pull-data-from-the-cloud-what-could-be-causing-it" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799617584336338954" target="_blank" rel="nofollow noopener noreferrer">Q: My peers and I share a repo where we have a folder that is versioned with DVC. I'm getting an error message when trying to pull data from the cloud. What could be causing it?</a><a href="#q-my-peers-and-i-share-a-repo-where-we-have-a-folder-that-is-versioned-with-dvc-im-getting-an-error-message-when-trying-to-pull-data-from-the-cloud-what-could-be-causing-it" aria-label="q my peers and i share a repo where we have a folder that is versioned with dvc im getting an error message when trying to pull data from the cloud what could be causing it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>I see you are having the following error:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> </span> Everything is up to date. ERROR: failed to pull data from the cloud - 'data\rhinoceros.dvc' format error: extra keys not allowed @ data['outs'][0]['size'] <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc doctor</span> </span> DVC version: 1.9.1 (exe) --------------------------------- Platform: Python 3.7.9 on Windows-10-10.0.19041-SP0 Supports: All remotes Cache types: hardlink Cache directory: NTFS on C:\ Workspace directory: NTFS on C:\ Repo: dvc, git</code></pre></div> <p>You're colleague is likely running a newer version of DVC. Upgrade so that all are on the same version and you will be good to go!</p> <p>Thanks @ojon for this important gem! 💎</p> <h3 id="q-how-do-i-create-multiple-pipeline-dvcyaml-files-for-different-experiments" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/824846339288334356" target="_blank" rel="nofollow noopener noreferrer">Q: How do I create multiple pipeline (<code>dvc.yaml</code>) files for different experiments?</a><a href="#q-how-do-i-create-multiple-pipeline-dvcyaml-files-for-different-experiments" aria-label="q how do i create multiple pipeline dvcyaml files for different experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You could create separate directories for each experiment and keep your pipelines organized with separate <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files. You can find more information on <a href="https://dvc.org/doc/user-guide/experiment-management#organization-patterns" target="_blank" rel="nofollow noopener noreferrer">organization patterns for experiments here.</a> Currently we are working on a way to compare metrics between different paths if using this method of keeping experiments in different directories. <a href="https://github.com/iterative/dvc/issues/5074" target="_blank" rel="nofollow noopener noreferrer">You can follow that issue here!</a></p> <p>Thanks @tijoseymathew for your question in Discord!</p> <h3 id="q-is-there-a-way-to-run-git-checkout-and-dvc-checkout-in-one-command" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/818488624303046677" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to run "git checkout and "dvc checkout" in one command?</a><a href="#q-is-there-a-way-to-run-git-checkout-and-dvc-checkout-in-one-command" aria-label="q is there a way to run git checkout and dvc checkout in one command permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yep! There's a way! We offer a Git hook for <code>post-checkout</code>, which automates DVC checkout right after <code>git checkout</code>. You can use <a href="https://dvc.org/doc/command-reference/install"><code>dvc install</code></a> to install that hook.<br> <a href="https://dvc.org/doc/command-reference/install" target="_blank" rel="nofollow noopener noreferrer">Check out these docs</a> for all the info on installing Git hooks <a href="https://dvc.org/doc/command-reference/install#example-checkout-both-git-and-dvc" target="_blank" rel="nofollow noopener noreferrer">and here</a> for a specific example!</p> <p>Many thanks to @Thyrix for this question!</p> <h3 id="q-how-do-i-set-a-remote-in-google-drive-and-share-with-someone-else" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/819432969260761131" target="_blank" rel="nofollow noopener noreferrer">Q: How do I set a remote in Google Drive and share with someone else?</a><a href="#q-how-do-i-set-a-remote-in-google-drive-and-share-with-someone-else" aria-label="q how do i set a remote in google drive and share with someone else permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://dvc.org/doc/user-guide/setup-google-drive-remote" target="_blank" rel="nofollow noopener noreferrer">These docs</a> will show you how to get a remote Google Drive set up! Be sure to setup the remote folder's permissions! For more information on sharing permissions in Google Drive <a href="https://support.google.com/drive/answer/7166529?co=GENIE.Platform%3DDesktop&hl=en" target="_blank" rel="nofollow noopener noreferrer">see these docs.</a></p> <p>Thanks @Carlos Lopez H for this important gem! 💎</p> <p><img src="https://media.giphy.com/media/l0IycQmt79g9XzOWQ/giphy.gif" alt="Shut It Down GIF by Matt Cutshall"></p> <p>At our April Office Hours Meetup we will be demo-ing pipelines as well as CML. <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/277245660/?isFirstPublish=true" target="_blank" rel="nofollow noopener noreferrer">RSVP for the Meetup here</a> to stay up to date with specifics as we get closer to the event!</p> <p><a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Join us in Discord</a> to get all your DVC and CML questions answered!</p>https://dvc.org/blog/March-21-dvc-heartbeathttps://dvc.org/blog/March-21-dvc-heartbeatMon, 15 Mar 2021 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Welcome to March! It's been a great month already! Here's all that will keep you in the know.</p> <p><img src="https://media.giphy.com/media/J2gg8fO7RarRgQRC4d/giphy.gif" alt="UnderRock"></p> <h2 id="icymi---dvc-20-is-here" style="position:relative;">ICYMI - DVC 2.0 is here!<a href="#icymi---dvc-20-is-here" aria-label="icymi dvc 20 is here permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you somehow missed our <a href="https://dvc.org/blog/dvc-2-0-release" target="_blank" rel="nofollow noopener noreferrer">March 3rd announcment</a>, DVC 2.0 is here with loads of features to make your life easier.</p> <p>🧪 Lightweight ML experiments</p> <p>📍 ML model checkpoints versioning</p> <p>📈 Dvc-live - new open-source library for metrics logging</p> <p>🔗 ML pipeline templating and iterative foreach-stages</p> <p>🤖 CML - new way to get GPU/CPU in clouds and GitHub Actions</p> <p>This video from the team gives you an overview of all the new features.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/h-ioXYurEJo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="and-we-keep-on-growing-our-worldwide-team-" style="position:relative;">And we keep on growing our worldwide team! 🌏<a href="#and-we-keep-on-growing-our-worldwide-team-" aria-label="and we keep on growing our worldwide team permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have three new team members this month!</p> <p><a href="https://www.linkedin.com/in/duijf/" target="_blank" rel="nofollow noopener noreferrer"><strong>Laurens Duijvesteijn</strong></a> joins the team from Utrecht, The Netherlands as a backend infrastructure engineer. Previously he led a devops team at Channable where he learned that he really enjoys working on developer tools and empowering people to do great work. When not solving dev challenges, he enjoys bouldering/climbing, snowboarding and hiking! Welcome Laurens!</p> <p><a href="https://github.com/0x2b3bfa0" target="_blank" rel="nofollow noopener noreferrer"><strong>Helio Machado</strong></a> joins our team from Spain as a CML engineer! Helio comes from a heutogogic background, mainly focused on the Free and Open Source culture and technologies from a systems perspective. You will find his clever cryptograph handle helping you out in Discord with your CML questions. Fun fact: Our two CML engineers, Helio and David Ortega live just 300 km apart in Spain! CML has some Spanish flare! 💃🏻🇪🇸</p> <p><a href="https://www.linkedin.com/in/mikhail-rozhkov-33549118/" target="_blank" rel="nofollow noopener noreferrer"><strong>MikHail Rozhkov</strong></a> joins us from Moscow, Russia as a Solution Engineer. Mikhail has been working with DVC for 2+ years in the banking industry and is also the creator of the <a href="https://mlrepa.com" target="_blank" rel="nofollow noopener noreferrer"><strong>Machine Learning REPA</strong></a> community as well as created our <a href="https://www.udemy.com/course/machine-learning-experiments-and-engineering-with-dvc/" target="_blank" rel="nofollow noopener noreferrer"><strong>first course on Udemy</strong></a>. We are so excited to have him officially join our team full-time!</p> <p><img src="https://media.giphy.com/media/3ohhwznAY9PN08m0H6/giphy.gif" alt="Join Us"></p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Come join our team! Open positions this month:</p> <p><a href="https://docs.google.com/document/d/1aT5HZYt4kAUxXqD4JNTe3jPDlVUwSmnEWDPR2QoKdvo/edit" target="_blank" rel="nofollow noopener noreferrer">TypeScript Front-End Engineer</a> to build SaaS and a VS Code UI for our popular machine learning tools: DVC and CML. The ML tools ecosystem is what JS space was 10 years ago. Come join us on this exciting project!</p> <p>Our search continues for a <a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer">Developer Advocate</a> to support and inspire developers by creating new content like blogs, tutorials, and videos - plus lead outreach through meetups and conferences.</p> <p>Does this sound like you or someone you know? Be in touch!</p> <h2 id="dmitry-featured-on-tfir-insights" style="position:relative;">Dmitry featured on TFIR Insights<a href="#dmitry-featured-on-tfir-insights" aria-label="dmitry featured on tfir insights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/tfir_io" target="_blank" rel="nofollow noopener noreferrer"><strong>Swapnil Bhartiya</strong></a> of <a href="https://www.tfir.io/" target="_blank" rel="nofollow noopener noreferrer">TFIR Insights</a> interviewed our very own CEO, <a href="https://twitter.com/fullstackml" target="_blank" rel="nofollow noopener noreferrer"><strong>Dmitry Petrov</strong></a>, on his show discussing:</p> <ul> <li>Iterative.ai</li> <li>Why Open Source is a better approach for AI/ML</li> <li>DVC and CML</li> <li>Who should care about these tools</li> <li>How DVC and CML stack up against proprietary AI Platforms such as AWS SageMaker and Microsoft Azure ML Engineer</li> </ul> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/lv2cpm9Pduk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="elle-at-datatalksclub-conference" style="position:relative;">Elle at DataTalks.Club Conference<a href="#elle-at-datatalksclub-conference" aria-label="elle at datatalksclub conference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://twitter.com/drelleobrien" target="_blank" rel="nofollow noopener noreferrer"><strong>Elle O'Brien</strong></a> presents her talk "Automating ML with Continuous Integration" at the <a href="http://datatalks.club/" target="_blank" rel="nofollow noopener noreferrer">DataTalks.Club</a> Conference with <a href="https://twitter.com/Al_Grigor" target="_blank" rel="nofollow noopener noreferrer"><strong>Alexey Grigorev</strong></a> and <a href="https://www.linkedin.com/in/dpbrinkm/" target="_blank" rel="nofollow noopener noreferrer"><strong>Demtrios Brinkmann</strong></a> of <a href="https://open.spotify.com/show/7wZygk3mUUqBaRbBGB1lgh" target="_blank" rel="nofollow noopener noreferrer">MLOps Community</a>. You can catch her talk starting at 3:03 below. 👇🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://www.youtube.com/watch?v=og1DG1KZ71c&t=11382s" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Automating ML with Continuous Integration</h4> <div class="elp-description">Elle O'Brien, PhD presents at DataTalks.Club Conference</div> <div class="elp-link">DataTalks.Club</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-03-15/confused-animals-3a01f72852765a7c4ced04e0819e8ba2.png" alt="Automating ML with Continuous Integration"> </div> </a> </section> <p></p> <h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="using-dvc-in-lab-data-management" style="position:relative;">Using DVC in Lab Data Management<a href="#using-dvc-in-lab-data-management" aria-label="using dvc in lab data management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This great tutorial from <a href="https://mti-lab.github.io/blog/" target="_blank" rel="nofollow noopener noreferrer">Matsui-lab Blog</a> provides a solution using DVC for the data management problem labs face.</p> <p> </p><section class="elp-content-holder"> <a href="https://mti-lab.github.io/blog/yusuke%20matsui/education/labops/2021/03/03/dvc.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Versioning a Shared Dataset Using DVC and S3</h4> <div class="elp-description">DVC solution in a lab environment</div> <div class="elp-link">mti-lab.github.io</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-03-15/matsui-lab-blog-ed064db061f5e0f5ca1ce475fad16fe3.png" alt="Versioning a Shared Dataset Using DVC and S3"> </div> </a> </section> <p></p> <h3 id="healthcare-use-case-video-tutorial" style="position:relative;">Healthcare Use Case Video Tutorial<a href="#healthcare-use-case-video-tutorial" aria-label="healthcare use case video tutorial permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.linkedin.com/in/danial-senejohnny/" target="_blank" rel="nofollow noopener noreferrer"><strong>Danial Senejohnny</strong></a> created this video outlining the use of DVC for healthcare institutes where the data must be kept private and on premise data store is preferred. 👇🏼</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/K1iyWr4Z6go?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="scientific-journals-" style="position:relative;">Scientific Journals 🧑🏻‍🔬<a href="#scientific-journals-" aria-label="scientific journals permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We are excited to announce a scientific paper purely devoted to DVC coming out from Queen's University. This publication by <a href="https://www.linkedin.com/in/amine-barrak-0bb99160/" target="_blank" rel="nofollow noopener noreferrer"><strong>Amine Barrak</strong></a>, <a href="https://www.linkedin.com/in/elliseghan/" target="_blank" rel="nofollow noopener noreferrer"><strong>Ellis E Eghan</strong></a> and <a href="https://www.linkedin.com/in/bramadams/" target="_blank" rel="nofollow noopener noreferrer"><strong>Bram Adams</strong></a>, will be presented at the 28th IEEE International Conference on Software Analysis, Evolution, and Reengineering. You can check it out here. 👇🏼</p> <p> </p><section class="elp-content-holder"> <a href="https://mcis.cs.queensu.ca/publications/2021/saner.pdf" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects</h4> <div class="elp-description">Empirical Study of DVC Projects</div> <div class="elp-link">mcis.cs.queensu.ca</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-03-15/EmpiricalStudyDVC-3e11b88175803e4a49d528d1f008126d.png" alt="On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects"> </div> </a> </section> <p></p> <p>This article by <strong>Samuel Idowu</strong>, <a href="https://www.linkedin.com/in/daniel-g-str%C3%BCber-359134100/" target="_blank" rel="nofollow noopener noreferrer"><strong>Daniel Struber</strong></a>, and <a href="https://www.linkedin.com/in/thorsten-berger-3a6a851ab/" target="_blank" rel="nofollow noopener noreferrer"><strong>Thorsten Berger</strong></a>, reviews a number of asset management tools for machine learning including DVC, that solve the commonly reported ML engineering challenges.</p> <p> </p><section class="elp-content-holder"> <a href="https://arxiv.org/pdf/2102.06919.pdf" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Asset Management in Machine Learning: A Survey</h4> <div class="elp-description">Steps to use DVC in your data versioning</div> <div class="elp-link">arxiv.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-03-15/arxiv-89cc24e1d73a143584fc0fb6a35d39a5.png" alt="Asset Management in Machine Learning: A Survey"> </div> </a> </section> <p></p> <p><img src="https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif" alt="ScienceMindBlown"></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>From a Portuguese speaking community member in Finland…</p> <blockquote> <p>"The @DVCorg surely it is among the best tools of the ecosystem of the last 3 years. It won't be long before DVC is as common as Scikit-Learn in ML / DS projects with high maturity. 👏🏼👏🏼👏🏼"</p> </blockquote> <p>O <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">@DVCorg</a> seguramente está entre as melhores ferramentas do ecossistema dos últimos 3 anos. Não vai demorar para o DVC ser tão comum quanto o Scikit-Learn em projetos de ML/DS com alta maturidade. 👏👏👏 <a href="https://t.co/nnfecYoTQv" target="_blank" rel="nofollow noopener noreferrer">https://t.co/nnfecYoTQv</a></p> <p>— Flávio Clésio March 3, 2021</p> <p>We think so too! 🙌🏼 You're all caught up! See you at the next Community Gems 💎!</p> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/dvc-2-0-releasehttps://dvc.org/blog/dvc-2-0-releaseWed, 03 Mar 2021 00:00:00 GMT<h2 id="tldr-video" style="position:relative;">TL;DR; video<a href="#tldr-video" aria-label="tldr video permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/h-ioXYurEJo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="what-is-new-in-dvc-20" style="position:relative;">What is new in DVC 2.0?<a href="#what-is-new-in-dvc-20" aria-label="what is new in dvc 20 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have been working on DVC for almost 4 years. In the previous versions, we have built a great foundation on versioning data, code and ML models that helps make your ML projects reproducible.</p> <p>With the 2.0 release, we are going deeper into machine learning and deep learning scenarios such as <strong>experiment management</strong>, <strong>ML model checkpoints</strong> and <strong>ML metrics logging</strong>. These scenarios are widely adopted by ML practitioners and instrumented with custom tools or external frameworks and SaaS services. <strong>Our vision</strong> is to make the ML experimentation experience distributed (like Git) and independent of external SaaS platforms, and to introduce proper data and model management to ML experiments.</p> <p>⚠️ DVC 2.0 is the first release with ML experements, which is still in experimentation mode (yeah, experiments in experimentation mode 😅), so the API might change a bit in the following releases.</p> <p><strong>ML pipelines parametrization</strong> is another big improvement in DVC 2.0. This was the most requested feature during the last year. We are introducing variables in pipelines as well as foreach-stages. This is a significant improvement for users who work on multi-stages ML projects, which is very common for NLP projects.</p> <p>A better <strong>CPU/GPU resource allocation</strong> is another important direction for DVC. Together with DVC 2.0 we are releasing new version 0.3 of CML (CI/CD for ML). It aims to hide all complexity of clouds from data scientists and ML engineers. We developed a brand new Iterative Terraform Provider to reach this goal and simplify the end-user experience. In future releases, we expect DVC to use this Terraform provider to access cloud resources directly.</p> <p>The last but not least important part - we made the new release with <strong>minimum breaking changes to our API</strong>. That makes migration to DVC 2.0 smooth and low-risk.</p> <h2 id="install" style="position:relative;">Install<a href="#install" aria-label="install permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The new version is generally available!</p> <p>Install DVC 2.0 <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">through OS packages</a> or as Python library:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> dvc</span></code></pre></div> <p>CML is pre-installed in the CML docker containers (e.g. <code>iterativeai/cml:0-dvc2-base1</code>) and also available as an NPM package:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">npm</span> i <span class="token parameter variable">-g</span> @dvcorg/cml</span></code></pre></div> <h2 id="lightweight-ml-experiments" style="position:relative;">Lightweight ML experiments<a href="#lightweight-ml-experiments" aria-label="lightweight ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC uses Git versioning as the basis for ML experiments. This solid foundation makes each experiment reproducible and accessible from the project's history. This Git-based approach works very well for ML projects with mature models when only a few new experiments per day are run.</p> <p>However, in more active development, when dozens or hundreds of experiments need to be run in a single day, Git creates overhead — each experiment run requires additional Git commands <code>git add/commit</code>, and comparing all experiments is difficult.</p> <p>We are introducing lightweight experiments in DVC 2.0! This is how you can auto-track ML experiments without any overhead.</p> <p>⚠️ Note, our new ML experiment features (<a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp</code></a>) are experimental. This means that the commands might change a bit in the following minor releases.</p> <p><a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> can run an ML experiment with a new hyperparameter from <code>params.yaml</code> while <a href="https://dvc.org/doc/command-reference/exp/diff"><code>dvc exp diff</code></a> shows metrics and params difference:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">3000</span> </span> Reproduced experiment(s): exp-bb55c Experiment results have been applied to your workspace. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp diff</span> </span>Path Metric Value Change scores.json auc 0.57462 0.0072197 Path Param Value Change params.yaml featurize.max_features 3000 1500</code></pre></div> <p>More experiments:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">4000</span> </span>Reproduced experiment(s): exp-9bf22 Experiment results have been applied to your workspace. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span> </span>Reproduced experiment(s): exp-63ee0 Experiment results have been applied to your workspace. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span> <span class="token punctuation">\</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.ngrams</span><span class="token operator">=</span><span class="token number">3</span> </span>Reproduced experiment(s): exp-80655 Experiment results have been applied to your workspace.</code></pre></div> <p>In the examples above, hyperparameters were changed with the <code>--set-param</code> option, but you can make these changes by modifying the params file instead. In fact <em>any code can be changed</em> and <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> will capture the variations.</p> <p>See all the runs:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span> <span class="token parameter variable">--include-params</span> featurize.max_features,featurize.ngrams</span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>auc<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.max_features<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.ngrams<span class="token hide">**</span></span> </span> ───────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 0.56359 5000 3 master 0.5674 1500 2 ├── exp-80655 0.56359 5000 3 ├── exp-63ee0 0.5515 5000 2 ├── exp-9bf22 0.56448 4000 2 └── exp-bb55c 0.57462 3000 2 </span> ─────────────────────────────────────────────────────────────────────</code></pre></div> <p>Under the hood, DVC uses Git to store the experiments' meta-information. A straight-forward implementation would create visible branches and auto-commit in them, but that approach would over-pollute the branch namespace very quickly. To avoid this issue, we introduced custom Git references <code>exps</code>, the same way as GitHub uses custom references <code>pulls</code> to track pull requests (this is an interesting technical topic that deserves a separate blog post). Below you can see how it works.</p> <p>No artificial branches, only custom references <code>exps</code> (do not worry if you don't understand this part - it is an implementation detail):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> branch </span>* master <span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref </span>5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY 5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH 5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655 f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0 0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c 9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22</code></pre></div> <p>The best experiment can be promoted to the workspace and committed to Git.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-bb55c </span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">'optimize max feature size'</span></span></code></pre></div> <p>Alternatively, an experiment can be promoted to a branch (<code>big_fr_size</code> branch in this case):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp branch</span> exp-80655 big_fr_size </span>Git branch 'big_fr_size' has been created from experiment 'exp-c695f'. To switch to the new branch run: git checkout big_fr_size</code></pre></div> <p>Remove all the experiments that were not used:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--force</span></span></code></pre></div> <h2 id="ml-model-checkpoints-versioning" style="position:relative;">ML model checkpoints versioning<a href="#ml-model-checkpoints-versioning" aria-label="ml model checkpoints versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>ML model checkpoints are an essential part of deep learning. ML engineers prefer to save the model files (or weights) at checkpoints during a training process and return back when metrics start diverging or learning is not fast enough.</p> <p>The checkpoints create a different dynamics around ML modeling process and need a special support from the toolset:</p> <ol> <li>Track and save model checkpoints (DVC outputs) periodically, not only the final result or training epoch.</li> <li>Save metrics corresponding to each of the checkpoints.</li> <li>Reuse checkpoints - warm-start training with an existing model file, corresponding code, dataset version and metrics.</li> </ol> <p>This new behavior is supported in DVC 2.0. Now, DVC can version all your checkpoints with corresponding code and data. It brings the reproducibility of DL processes to the next level - every checkpoint is reproducible.</p> <p>This is how you define checkpoints with live-metrics:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc stage add</span> <span class="token parameter variable">-n</span> train <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> users.csv <span class="token parameter variable">-d</span> train.py <span class="token punctuation">\</span> <span class="token parameter variable">-p</span> dropout,epochs,lr,process <span class="token punctuation">\</span> <span class="token parameter variable">--checkpoint</span> model.h5 <span class="token punctuation">\</span> <span class="token parameter variable">--live</span> logs <span class="token punctuation">\</span> python train.py </span> Creating 'dvc.yaml' Adding stage 'train' in 'dvc.yaml'</code></pre></div> <p>Note, we use <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command instead of <code>dvc run</code>. Starting from DVC 2.0 we begin extracting all stage specific functionality under <a href="https://dvc.org/doc/command-reference/stage"><code>dvc stage</code></a> umbrella. <code>dvc run</code> is still working, but will be deprecated in the following major DVC version (most likely in 3.0).</p> <p>Start the training process and interrupt it after 5 epochs:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>'users.csv.dvc' didn't change, skipping Running stage 'train': > python train.py ... ^CTraceback (most recent call last): ... KeyboardInterrupt</code></pre></div> <p>Navigate in checkpoints:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> </span> ────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 4 2.0702 0.30388 2.025 … 5 … master - - - - … 5 … │ ╓ exp-e15bc 4 2.0702 0.30388 2.025 … 5 … │ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 … │ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 … │ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 … │ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 … ├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 … </span> ──────────────────────────────────────────────────────────────────────</code></pre></div> <p>Each of the checkpoints above is a separate experiment with all data, code, paramaters and metrics. You can use the same <a href="https://dvc.org/doc/command-reference/exp/apply"><code>dvc exp apply</code></a> command to extract any of these.</p> <p>Another run continues this process. You can see how accuracy metrics are increasing - DVC does not remove the model/checkpoint and training code trains on top of it:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>Existing checkpoint experiment 'exp-e15bc' will be resumed ... ^C KeyboardInterrupt <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> </span> ────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 9 1.7845 0.58125 1.7381 … 5 … master - - - - … 5 … │ ╓ exp-e15bc 9 1.7845 0.58125 1.7381 … 5 … │ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 … │ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 … │ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 … │ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 … │ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 … │ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 … │ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 … │ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 … │ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 … ├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 … </span> ──────────────────────────────────────────────────────────────────────</code></pre></div> <p>After modifying the code, data, or params, the same process can be resumed. DVC recognizes the change and shows it (see experiment <code>b363267</code>):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> train.py <span class="token comment"># modify code</span> </span><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> params.yaml <span class="token comment"># modify params</span> </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>Modified checkpoint experiment based on 'exp-e15bc' will be created ... <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> </span> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 13 1.5841 0.69262 1.5381 … 15 … master - - - - … 5 … │ ╓ exp-7ff06 13 1.5841 0.69262 1.5381 … 15 … │ ╟ 6c62fec 12 1.6325 0.67248 1.5857 … 15 … │ ╟ 4baca3c 11 1.6817 0.64855 1.6349 … 15 … │ ╟ b363267 (2b06de7) 10 1.7323 0.61925 1.6857 … 15 … │ ╓ 2b06de7 9 1.7845 0.58125 1.7381 … 5 … │ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 … │ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 … │ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 … │ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 … │ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 … │ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 … │ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 … │ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 … │ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 … ├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 … </span> ──────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>Sometimes you might need to train the model from scratch. The reset option removes the checkpoint file before training: <a href="https://dvc.org/doc/command-reference/exp/run#--reset"><code>dvc exp run --reset</code></a>.</p> <h2 id="metrics-logging" style="position:relative;">Metrics logging<a href="#metrics-logging" aria-label="metrics logging permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Continuously logging ML metrics is a very common practice in the ML world. Instead of a simple command-line output with the metrics values, many ML engineers prefer visuals and plots. These plots can be organized in a "database" of ML experiments to keep track of a project. There are many special solutions for metrics collecting and experiment tracking such as sacred, mlflow, weight and biases, neptune.ai, or others.</p> <p>With DVC 2.0, we are releasing a new open-source library <a href="https://github.com/iterative/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVC-Live</a> that provides functionality for tracking model metrics and organizing metrics in simple text files in a way that DVC can visualize the metrics with navigation in Git history. So, DVC can show you a metrics difference between the current model and a model in <code>master</code> or any other branch.</p> <p>This approach is similar to the other metrics tracking tools with the difference that Git becomes a "database" or of ML experiments.</p> <h3 id="generate-metrics-file" style="position:relative;">Generate metrics file<a href="#generate-metrics-file" aria-label="generate metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Install the library:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> dvclive</span></code></pre></div> <p>Instrument your code:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvclive <span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DvcLiveCallback dvclive<span class="token punctuation">.</span>init<span class="token punctuation">(</span><span class="token string">"logs"</span><span class="token punctuation">)</span> <span class="token comment">#, summarize=True)</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment"># Set up DVC-Live callback:</span> callbacks<span class="token operator">=</span><span class="token punctuation">[</span> DvcLiveCallback<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> </code></pre></div> <p>During the training you will see the metrics files that are continuously populated each epochs:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span> logs/ </span>accuracy.tsv loss.tsv val_accuracy.tsv val_loss.tsv <span class="token line"><span class="token input">$ </span><span class="token command">head</span> logs/accuracy.tsv </span>timestamp step accuracy 1613645582716 0 0.7360000014305115 1613645585478 1 0.8349999785423279 1613645587322 2 0.8830000162124634 1613645589125 3 0.9049999713897705 1613645590891 4 0.9070000052452087 1613645592681 5 0.9279999732971191 1613645594490 6 0.9430000185966492 1613645596232 7 0.9369999766349792 1613645598034 8 0.9430000185966492</code></pre></div> <p>In addition to the continuous metrics files, you will see the summary metrics file and HTML file with the same file prefix. The summary file contains the result of the latest epoch:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> logs.json <span class="token operator">|</span> python <span class="token parameter variable">-m</span> json.tool </span>{ "step": 41, "loss": 0.015958430245518684, "accuracy": 0.9950000047683716, "val_loss": 13.705962181091309, "val_accuracy": 0.5149999856948853 }</code></pre></div> <p>The HTML file contains all the visuals for continuous metrics as well as the summary metrics on a single page:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b66f0f1e2076cdf2661acb4f621e7255/39600/dvclive-html.png" alt="dvclive html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Note, the HTML and the summary metrics files are generating automatically for each. So, you can monitor model performance in realtime.</p> <h3 id="git-navigation-with-the-metrics-file" style="position:relative;">Git-navigation with the metrics file<a href="#git-navigation-with-the-metrics-file" aria-label="git navigation with the metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC repository is NOT required to use the live metrics functionality from the above. It works independently from DVC.</p> <p>DVC repository becomes useful when the metrics and plots are committed in your Git repository, and you need navigation around the metrics.</p> <p>Metrics difference between workspace and the last Git commit:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git status</span> <span class="token parameter variable">-s</span> </span> M logs.json M logs/accuracy.tsv M logs/loss.tsv M logs/val_accuracy.tsv M logs/val_loss.tsv M train.py ?? model.h5 <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json </span>Path Metric Old New Change logs.json accuracy 0.995 0.99 -0.005 logs.json loss 0.01596 0.03036 0.0144 logs.json step 41 36 -5 logs.json val_accuracy 0.515 0.5175 0.0025 logs.json val_loss 13.70596 3.29033 -10.41563</code></pre></div> <p>The difference between a particular commit/branch/tag or between two commits:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json HEAD^ 47b85c </span>Path Metric Old New Change logs.json accuracy 0.995 0.998 0.003 logs.json loss 0.01596 0.01951 0.00355 logs.json step 41 82 41 logs.json val_accuracy 0.515 0.51 -0.005 logs.json val_loss 13.70596 5.83056 -7.8754</code></pre></div> <p>The same Git-navigation works with the plots:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">--target</span> logs </span>file:///Users/dmitry/src/exp-dc/plots.html</code></pre></div> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cdc4ec4dabed1d7de6b8606667ebfc83/39600/dvclive-diff-html.png" alt="dvclive diff html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Another nice thing about the live metrics - they work across ML experiments and checkpoints, if properly set up in dvc stages. To set up live metrics, you need to specify the metrics directory in the <code>live</code> section of a stage:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">live</span><span class="token punctuation">:</span> <span class="token key atrule">logs</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">summary</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">report</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data</code></pre></div> <h2 id="ml-pipelines-parameterization-and-foreach-stages" style="position:relative;">ML pipelines parameterization and foreach stages<a href="#ml-pipelines-parameterization-and-foreach-stages" aria-label="ml pipelines parameterization and foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>After introducing the multi-stage pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, it was quickly adopted among our users. The DVC team got tons of positive feedback from them, as well as feature requests.</p> <h3 id="pipeline-parameters-from-vars" style="position:relative;">Pipeline parameters from <code>vars</code><a href="#pipeline-parameters-from-vars" aria-label="pipeline parameters from vars permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The most requested feature was the ability to use parameters in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. For example. So, you can pass the same seed value or filename to multiple stages in the pipeline.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">train_matrix</span><span class="token punctuation">:</span> train.pkl <span class="token punctuation">-</span> <span class="token key atrule">test_matrix</span><span class="token punctuation">:</span> test.pkl <span class="token punctuation">-</span> <span class="token key atrule">seed</span><span class="token punctuation">:</span> <span class="token number">20210215</span> <span class="token punctuation">...</span> <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">process</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python process.py \ <span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span> \ <span class="token punctuation">-</span><span class="token punctuation">-</span>train $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> \ <span class="token punctuation">-</span><span class="token punctuation">-</span>test $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> <span class="token punctuation">...</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span></code></pre></div> <p>Also, it gives an ability to localize all the important parameters in a single <code>vars</code> block and play with them. This is a natural thing to do for scenarios like NLP or when hyperparameter optimization is happening not only in the model training code but in the data processing as well.</p> <h3 id="pipeline-parameters-from-params-files" style="position:relative;">Pipeline parameters from params files<a href="#pipeline-parameters-from-params-files" aria-label="pipeline parameters from params files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It is quite common to define pipeline parameters in a config file or a parameters file (like <code>params.yaml</code>) instead of in the pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> itself. These parameters defined in <code>params.yaml</code> can also be used in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># params.yaml</span> <span class="token key atrule">models</span><span class="token punctuation">:</span> <span class="token key atrule">us</span><span class="token punctuation">:</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span> <span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span></code></pre></div> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># dvc.yaml</span> <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">build-us</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span> python script.py <span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>models.us.thresh<span class="token punctuation">}</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span></code></pre></div> <p>DVC properly tracks params dependencies for each stage starting from the previous DVC version 1.0. See the <a href="https://dvc.org/doc/command-reference/run#for-displaying-and-comparing-data-science-experiments" target="_blank" rel="nofollow noopener noreferrer"><code>--params</code> option</a> of <code>dvc run</code> for more details.</p> <h3 id="iterating-over-params-with-foreach-stages" style="position:relative;">Iterating over params with foreach stages<a href="#iterating-over-params-with-foreach-stages" aria-label="iterating over params with foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Iterating over params was a frequently requested feature. Now users can define multiple similar stages with a templatized command.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">build</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token key atrule">gb</span><span class="token punctuation">:</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">15</span> <span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-gb.hdf5'</span> <span class="token key atrule">us</span><span class="token punctuation">:</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span> <span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span> <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span> python script.py <span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>item.thresh<span class="token punctuation">}</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span></code></pre></div> <h2 id="new-method-to-provision-cloud-compute-in-new-cml-release" style="position:relative;">New method to provision cloud compute in new CML release<a href="#new-method-to-provision-cloud-compute-in-new-cml-release" aria-label="new method to provision cloud compute in new cml release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We are releasing new CML release 0.3 together with DVC 2.0. We developed a brand new CML command <code>cml runner</code> that hides much of the complexity of configuring and provisioning an instance, keeping your workflows free of bash scripting clutter.</p> <p>The new approach uses our new <a href="https://github.com/iterative/terraform-provider-iterative" target="_blank" rel="nofollow noopener noreferrer">Iterative Terraform Provider</a> under the hood instead of Docker Machine, as in the first version of CML.</p> <p>This example workflow to launch an EC2 instance from a GitHub Action workflow and then train a model. We hope you'll agree it's shorter, sweeter, and more powerful than ever!</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train in the cloud'</span> <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> deploy <span class="token key atrule">shell</span><span class="token punctuation">:</span> bash <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml runner \ --cloud aws \ --cloud-region us-west \ --cloud-type=t2.micro \ --labels=cml-runner</span> <span class="token key atrule">train-model</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train my model'</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> pip install -r requirements.txt python train.py</span></code></pre></div> <p>You'll get a pull request that looks something like this:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c06746a683bc64bdcbde8464ca728656/39600/sample_pr.png" alt="sample pr" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>All the code to replicate this example is up on a <a href="https://github.com/iterative/cml-runner-base-case" target="_blank" rel="nofollow noopener noreferrer">brand new demo repository</a>.</p> <p>Please find more details in the <a href="https://dvc.org/blog/cml-runner-prerelease" target="_blank" rel="nofollow noopener noreferrer">CML 0.3 pre-release blog post</a> or in the <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML website</a>.</p> <h2 id="github-actions-in-new-cml-release" style="position:relative;">GitHub Actions in new CML release<a href="#github-actions-in-new-cml-release" aria-label="github actions in new cml release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>One more thing: you might've noticed in our example workflow above that there's a <a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">new CML GitHub Action</a>! The new Action helps you setup CML, giving you one more way to mix and match the CML suite of functions with your preferred environment.</p> <p>The new Action is designed to be a straightforward, all-in-one install that gives you immediate use of functions like <code>cml publish</code> and <code>cml runner</code>. You'll add this step to your workflow:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1</code></pre></div> <p><a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">More details are in the docs!</a></p> <p>The same way you can reference DVC as a GitHub Action:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/dvc<span class="token punctuation">-</span>action@v1</code></pre></div> <p><a href="https://github.com/iterative/setup-dvc" target="_blank" rel="nofollow noopener noreferrer">See DVC GitHub Action</a></p> <h2 id="breaking-changes" style="position:relative;">Breaking changes<a href="#breaking-changes" aria-label="breaking changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We put a lot of efforts to make this release with very minimum amount of breaking changes to simplify migration to the new version for the users:</p> <ol> <li>Dropped support for external outputs in Google Cloud Storage and changed the default checksum from md5 to etag.</li> <li>Dropped support for login with p12 files on service authentication for Google Drive.</li> <li>Stages without dependencies will not always run as if changed. Instead, use <code>--always-changed</code>.</li> <li>Environment variables inside the cmd of a stage using <code>${VAR}</code> syntax must be escaped as <code>\${VAR}</code> in 2.0 due to the use of <code>${}</code> syntax for templating.</li> </ol> <h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thank you to all DVC users and community members for the help. Please try out the new DVC and CML releases and do not get lost in your ML experiments!</p>https://dvc.org/blog/february-21-community-gemshttps://dvc.org/blog/february-21-community-gemsFri, 26 Feb 2021 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC Questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-i-noticed-i-have-a-dvc-config-file-and-a-configlocal-file-whats-best-practice-for-committing-these-to-my-git-repository" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/666708671333400599" target="_blank" rel="nofollow noopener noreferrer">Q: I noticed I have a DVC <code>config</code> file and a <code>config.local</code> file. What's best practice for committing these to my Git repository?</a><a href="#q-i-noticed-i-have-a-dvc-config-file-and-a-configlocal-file-whats-best-practice-for-committing-these-to-my-git-repository" aria-label="q i noticed i have a dvc config file and a configlocal file whats best practice for committing these to my git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC uses the <code>config</code> and <code>config.local</code> files to link your remote data repository to your project. <code>config</code> is intended to be committed to Git, while <code>config.local</code> is not - it's a file that you use to store sensitive information (e.g. your personal credentials - username, password, access keys, etc. for remote storage) or settings that are specific to your local environment.</p> <p>Usually, you don't have to worry about ensuring your <code>config.local</code> file is being ignored by Git- the only way to create a <code>config.local</code> file is using the <code>--local</code> flag explicitly in functions like <a href="https://dvc.org/doc/command-reference/remote"><code>dvc remote</code></a> and <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a> commands, so you'll know you've made one! And your <code>config.local</code> file is <code>.gitignored</code> by default. If you're concerned, take a look and make sure there are no settings in your <code>config.local</code> file that you actually want in your regular <code>config</code> file.</p> <p>To learn more about <code>config</code> and <code>config.local</code>, <a href="https://dvc.org/doc/command-reference/remote#example-add-a-default-local-remote" target="_blank" rel="nofollow noopener noreferrer">read up in our docs</a>.</p> <h3 id="q-whats-the-best-way-to-install-the-new-version-of-dvc-in-a-conda-environment-im-concerned-about-the-paramiko-dependency" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/669173874247729165" target="_blank" rel="nofollow noopener noreferrer">Q: What's the best way to install the new version of DVC in a Conda environment? I'm concerned about the <code>paramiko</code> dependency.</a><a href="#q-whats-the-best-way-to-install-the-new-version-of-dvc-in-a-conda-environment-im-concerned-about-the-paramiko-dependency" aria-label="q whats the best way to install the new version of dvc in a conda environment im concerned about the paramiko dependency permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When you install DVC via <code>conda</code>, it will come with dependencies like <code>paramiko</code>.</p> <p>The only exception when installing DVC as a Python library is with <code>pip</code>: you might want to specify the kind of remote storage you need to make sure all dependencies are present (like <code>boto</code> for S3). You can run <code>pip install "dvc[<option>]"</code>, with supported options like <code>[s3]</code>, <code>[azure]</code>, <code>[gdrive]</code>, <code>[gs]</code>, <code>[oss]</code>, <code>[ssh]</code>. Or, use <code>[all]</code> to include them all.</p> <p>For more about installing DVC and its dependencies, <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">check out our docs</a>.</p> <h3 id="q-how-do-i-keep-track-of-changes-in-modules-that-my-dvc-pipeline-depends-on-for-example-i-have-a-pipeline-stage-that-runs-a-script-preparepy-which-imports-a-module-modulepy-if-modulepy-changes-how-will-dvc-know-to-rerun-the-pipeline-stage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/663952575984435220" target="_blank" rel="nofollow noopener noreferrer">Q: How do I keep track of changes in <em>modules</em> that my DVC pipeline depends on? For example, I have a pipeline stage that runs a script <code>prepare.py</code>, which imports a module <code>module.py</code>. If <code>module.py</code> changes, how will DVC know to rerun the pipeline stage?</a><a href="#q-how-do-i-keep-track-of-changes-in-modules-that-my-dvc-pipeline-depends-on-for-example-i-have-a-pipeline-stage-that-runs-a-script-preparepy-which-imports-a-module-modulepy-if-modulepy-changes-how-will-dvc-know-to-rerun-the-pipeline-stage" aria-label="q how do i keep track of changes in modules that my dvc pipeline depends on for example i have a pipeline stage that runs a script preparepy which imports a module modulepy if modulepy changes how will dvc know to rerun the pipeline stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If your DVC pipeline only lists <code>prepare.py</code> as a dependency, then changing code in module files won't trigger a re-run of the pipeline. Meaning that if you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> after updating <code>module.py</code>, DVC will simply return the result of your last pipeline run and a message that nothing has changed.</p> <p>To explain further why this happens:</p> <p>DVC is platform agnostic and it doesn't know whether your command's executable is <code>python</code>, some other script interpreter, or a compiled binary for that matter.</p> <blockquote> <p>E.g. this is a valid stage: <code>dvc run -o hello.txt 'echo "Hello!" > hello.txt'</code> (where the executable is echo).</p> </blockquote> <p>DVC also doesn't know what's going on inside the command's source code. Therefore, any file that your code requires internally should be explicitly specified as a pipeline stage dependency (in CLI, <code>dvc run -d</code> , or in YAML, <code>deps:</code>) for DVC to track it.</p> <p>If you're not interested in adding modules as explicit dependencies, there are a few other approaches:</p> <ul> <li>Make your <code>requirements.txt</code> file a stage dependency (if the loaded module comes from a package).</li> <li>Manually rebuild the pipeline (with <a href="https://dvc.org/doc/command-reference/repro#--force"><code>dvc repro --force <stage>.dvc</code></a>) when you know an unmarked dependency is changed – although this is prone to human error.</li> <li>Have a version/build number comment in the main script that always gets updated when an unmarked dependency changes – this could be automated.</li> </ul> <p><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/658501655641325580" target="_blank" rel="nofollow noopener noreferrer">See here for more information on similar use cases.</a></p> <p>We also have an ongoing discussion about this issue on our GitHub repository, and we'd love your input. <a href="https://github.com/iterative/dvc/issues/1577#issuecomment-568391709" target="_blank" rel="nofollow noopener noreferrer">Please participate in this issue if you can here!</a></p> <h3 id="q-my-dvc-pipeline-has-a-lot-of-dependencies-and-i-dont-want-to-manually-write-them-all-out-in-my-dvcyaml-file-are-there-any-ways-to-use-wildcards-like--or-specify-directories-as-dependencies" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/803961071135883294" target="_blank" rel="nofollow noopener noreferrer">Q: My DVC pipeline has <em>a lot</em> of dependencies, and I don't want to manually write them all out in my <code>dvc.yaml</code> file. Are there any ways to use wildcards (like <code>*</code>) or specify directories as dependencies?</a><a href="#q-my-dvc-pipeline-has-a-lot-of-dependencies-and-i-dont-want-to-manually-write-them-all-out-in-my-dvcyaml-file-are-there-any-ways-to-use-wildcards-like--or-specify-directories-as-dependencies" aria-label="q my dvc pipeline has a lot of dependencies and i dont want to manually write them all out in my dvcyaml file are there any ways to use wildcards like or specify directories as dependencies permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, you can set a directory to be a dependency or an output of a DVC pipeline stage. This means you can have tens, hundreds, thousands or millions of dependency files in one directory, and all you have to declare in the pipeline is the address of that directory.</p> <p><a href="https://dvc.org/doc/command-reference/run#options" target="_blank" rel="nofollow noopener noreferrer">Check out the all the options here.</a></p> <h2 id="cml-questions" style="position:relative;">CML Questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-i-heard-theres-a-new-cml-feature-using-terraform-to-provision-runners-when-is-this-coming-out" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/812069229473562624" target="_blank" rel="nofollow noopener noreferrer">Q: I heard there's a new CML feature using Terraform to provision runners. When is this coming out?</a><a href="#q-i-heard-theres-a-new-cml-feature-using-terraform-to-provision-runners-when-is-this-coming-out" aria-label="q i heard theres a new cml feature using terraform to provision runners when is this coming out permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You're in luck, because we just shared this feature as part of the CML 0.3.0 pre-release! The pre-release introduced a new function, <code>cml runner</code>, which upgraded our <a href="https://github.com/iterative/cml_cloud_case/blob/b76aba13791ce18c5715f464f58877ffa10d4cfa/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">previous method for launching instances in the cloud from a CI workflow using Docker Machine</a>. In the new <code>cml runner</code> function built on Terraform, you can deploy instances in AWS and Azure with a single command (it used to take about 30 lines of code!). For example, to launch a <code>t2.micro</code> instance on AWS from your GitHub Actions or GitLab CI workflow, you'll run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">cml runner <span class="token punctuation">\</span> <span class="token parameter variable">--cloud</span> aws <span class="token punctuation">\</span> --cloud-region us-west <span class="token punctuation">\</span> --cloud-type<span class="token operator">=</span>t2.micro <span class="token punctuation">\</span> <span class="token parameter variable">--labels</span><span class="token operator">=</span>cml-runner</code></pre></div> <p>Check out the <a href="https://dvc.org/blog/cml-runner-prerelease" target="_blank" rel="nofollow noopener noreferrer">pre-release notes</a> and our <a href="https://github.com/iterative/cml-runner-base-case" target="_blank" rel="nofollow noopener noreferrer">example project repository</a> to get started.</p> <h3 id="q-my-ci-workflow-creates-a-reportmdhttpreportmd-document-that-gets-published-to-my-pull-request-by-cml-i-want-to-save-the-reportmd-file-to-my-repository-too-is-this-possible" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/810946119374340127" target="_blank" rel="nofollow noopener noreferrer">Q: My CI workflow creates a <code>[report.md](http://report.md)</code> document that gets published to my pull request by CML. I want to save the <code>report.md</code> file to my repository, too. Is this possible?</a><a href="#q-my-ci-workflow-creates-a-reportmdhttpreportmd-document-that-gets-published-to-my-pull-request-by-cml-i-want-to-save-the-reportmd-file-to-my-repository-too-is-this-possible" aria-label="q my ci workflow creates a reportmdhttpreportmd document that gets published to my pull request by cml i want to save the reportmd file to my repository too is this possible permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>By default, files that are created in a GitHub Actions or GitLab CI workflow only exist on the runner- as soon as the runner turns off, they vanish. Functions like <code>cml publish</code> and <code>cml send-comment</code> create persistent links to data visualizations, tables, and other outputs of your workflow so you can view them long after your run ends. However, by design, CML doesn't commit files to your repository (not all users want this!)</p> <p>What you're likely looking for is an auto-commit, to essentially <code>git add</code> and <code>git commit</code> files generated by the workflow to your repository. You can manually write this code into your workflow file, or you can use a GitHub Action tool like the <a href="https://github.com/marketplace/actions/git-auto-commit" target="_blank" rel="nofollow noopener noreferrer">Auto Commit</a> or <a href="https://github.com/marketplace/actions/add-commit" target="_blank" rel="nofollow noopener noreferrer">Add & Commit</a> Actions.</p> <h3 id="q-do-you-have-any-suggested-caching-strategies-with-cml-and-dvc-my-dvc-pipeline-runs-in-a-ci-workflow-and-it-depends-on-15-gb-of-data-i-dont-want-to-download-this-dataset-to-my-runner-every-time-the-workflow-runs" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/812059539696386079" target="_blank" rel="nofollow noopener noreferrer">Q: Do you have any suggested caching strategies with CML and DVC? My DVC pipeline runs in a CI workflow, and it depends on ~15 GB of data. I don't want to download this dataset to my runner every time the workflow runs.</a><a href="#q-do-you-have-any-suggested-caching-strategies-with-cml-and-dvc-my-dvc-pipeline-runs-in-a-ci-workflow-and-it-depends-on-15-gb-of-data-i-dont-want-to-download-this-dataset-to-my-runner-every-time-the-workflow-runs" aria-label="q do you have any suggested caching strategies with cml and dvc my dvc pipeline runs in a ci workflow and it depends on 15 gb of data i dont want to download this dataset to my runner every time the workflow runs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Downloading data to a runner on every CI workflow can be needlessly time consuming, particularly when the data rarely changes.</p> <p>While we don't have a CML-specific mechanism in the works for this use case, there are two main approaches we see as viable:</p> <ol> <li><strong>Attach an EBS volume</strong> to the instance that runs your workflow. If you're using DVC, DVC needs to run in that volume (at the very least, your DVC cache must be there). A user <a href="https://discord.com/channels/485586884165107732/728693131557732403/812059539696386079" target="_blank" rel="nofollow noopener noreferrer">recently let us know</a> that this approach is working well for them and prevents unnecessary re-downloads of their DVC cache. They also <a href="https://towardsdatascience.com/stop-duplicating-deep-learning-training-datasets-with-amazon-ebs-multi-attach-d9f61fdc1de4" target="_blank" rel="nofollow noopener noreferrer">recommended this article</a> for setup guidelines.</li> <li><strong>Use a shared DVC cache.</strong> Currently, many DVC users configure their cache in shared <a href="https://en.wikipedia.org/wiki/Network_File_System" target="_blank" rel="nofollow noopener noreferrer">NFS</a>. A similar setup that might help here is using a single shared development server- <a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">check out our docs for a use case</a>.</li> </ol> <hr> <p>As always, if you have any use case questions or need support, join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>! Or head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</p> <p>And, you can follow us on <a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and <a href="https://www.linkedin.com/company/iterative-ai" target="_blank" rel="nofollow noopener noreferrer">LinkedIn</a>!</p>https://dvc.org/blog/cml-runner-prereleasehttps://dvc.org/blog/cml-runner-prereleaseMon, 22 Feb 2021 00:00:00 GMT<p>Today, we're pre-releasing some new features in Continuous Machine Learning, or <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a>—our open source project to adapt popular continuous integration (CI) systems like GitHub Actions and GitLab CI for data science. CML has become a popular tool for auto-generating ML model reports right in a GitHub Pull Request and orchestrating resources for training models in the cloud.</p> <p>Here's what's in today's pre-release:</p> <h2 id="brand-new-method-to-provision-cloud-compute-for-your-ci-workflows" style="position:relative;">Brand new method to provision cloud compute for your CI workflows<a href="#brand-new-method-to-provision-cloud-compute-for-your-ci-workflows" aria-label="brand new method to provision cloud compute for your ci workflows permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>After the initial CML release, we found ways to significantly simplify the process of allocating resources in CI/CD. We developed a brand new CML command <code>cml runner</code> that hides much of the complexity of configuring and provisioning an instance, keeping your workflows free of <code>bash</code> scripting clutter (until the official release, docs are <a href="https://github.com/iterative/cml/blob/c2b96c461011f01ab2476e1542fb89d7229d150d/README.md" target="_blank" rel="nofollow noopener noreferrer">in development here</a>). The new approach uses Terraform provider under the hood instead of Docker Machine, as in the first version.</p> <p>Check out this example workflow to launch an EC2 instance from a GitHub Action workflow and then train a model. We hope you'll agree it's shorter, sweeter, and more powerful than ever!</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train in the cloud'</span> <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">deploy-runner</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> deploy <span class="token key atrule">shell</span><span class="token punctuation">:</span> bash <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_ACCESS_KEY_ID</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_ACCESS_KEY_ID <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">AWS_SECRET_ACCESS_KEY</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.AWS_SECRET_ACCESS_KEY <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> cml runner \ --cloud aws \ --cloud-region us-west \ --cloud-type=t2.micro \ --labels=cml-runner</span> <span class="token key atrule">train-model</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train my model'</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> pip install -r requirements.txt python train.py</span></code></pre></div> <p>If you use CML functions in the <code>train-model</code> step, you can go even further and get a closed loop—sending model training results from the EC2 instance to your pull request or merge request! For example, if we expand the <code>train-model</code> step to incorporate functions like <code>cml publish</code> and <code>cml send-comment</code>:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train-model</span><span class="token punctuation">:</span> <span class="token key atrule">needs</span><span class="token punctuation">:</span> deploy<span class="token punctuation">-</span>runner <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">-</span>runner<span class="token punctuation">]</span> <span class="token key atrule">container</span><span class="token punctuation">:</span> docker<span class="token punctuation">:</span>//dvcorg/cml <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/setup<span class="token punctuation">-</span>python@v2 <span class="token key atrule">with</span><span class="token punctuation">:</span> <span class="token key atrule">python-version</span><span class="token punctuation">:</span> <span class="token string">'3.x'</span> <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> <span class="token string">'Train a model'</span> <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.PERSONAL_ACCESS_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> pip install -r requirements.txt python train.py</span> echo "<span class="token comment">## Report from your EC2 Instance" > report.md</span> cat metrics.txt <span class="token punctuation">></span><span class="token punctuation">></span> report.md cml publish "plot.png" <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md cml send<span class="token punctuation">-</span>comment report.md</code></pre></div> <p>You'll get a pull request that looks something like this:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c06746a683bc64bdcbde8464ca728656/39600/sample_pr.png" alt="sample pr" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>All the code to replicate this example is up on a <a href="https://github.com/iterative/cml-runner-base-case" target="_blank" rel="nofollow noopener noreferrer">brand new demo repository</a>.</p> <h3 id="our-favorite-details" style="position:relative;">Our favorite details<a href="#our-favorite-details" aria-label="our favorite details permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The new <code>cml runner</code> function lets you turn on instances, including GPU, high-memory and spot instances, and kick off a new workflow using the hardware and environment of your choice—and of course, it'll turn <em>off</em> those instances after a configurable timeout! In the first CML release, this took <a href="https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">more than 30 lines of code</a> to configure. Now it's just one function.</p> <p>Another highlight: you can use whatever Docker container you'd like on your instance. In the above example, we use our <a href="https://github.com/iterative/cml/blob/master/Dockerfile" target="_blank" rel="nofollow noopener noreferrer">custom CML Docker container</a> (because we like it!)—but you certainly don't have to! Whatever image you choose, we highly recommend containerizing your environment for ultimate reproducibility and security with CML.</p> <p>You can also use the new <code>cml runner</code> function to set up a <a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">local self-hosted runner</a>. On your local machine or on-premise GPU cluster, you'll install CML as a package and then run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ cml runner <span class="token punctuation">\</span> <span class="token parameter variable">--repo</span> <span class="token variable">$your_project_repository_url</span> <span class="token punctuation">\</span> <span class="token parameter variable">--token</span><span class="token operator">=</span><span class="token variable">$personal_access_token</span> <span class="token punctuation">\</span> <span class="token parameter variable">--labels</span> tf <span class="token punctuation">\</span> --idle-timeout <span class="token number">180</span></code></pre></div> <p>Now your machine will be listening for workflows from your project repository.</p> <h2 id="a-new-github-action" style="position:relative;">A New GitHub Action<a href="#a-new-github-action" aria-label="a new github action permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>One more thing: you might've noticed in our example workflow above that there's a <a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">new CML GitHub Action</a>! The new Action helps you setup CML, giving you one more way to mix and match the CML suite of functions with your preferred environment.</p> <p>The new Action is designed to be a straightforward, all-in-one install that gives you immediate use of functions like <code>cml publish</code> and <code>cml runner</code>. You'll add this step to your workflow:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> iterative/setup<span class="token punctuation">-</span>cml@v1</code></pre></div> <p><a href="https://github.com/iterative/setup-cml" target="_blank" rel="nofollow noopener noreferrer">More details are in the docs!</a></p> <h2 id="get-ready-for-the-release" style="position:relative;">Get ready for the release<a href="#get-ready-for-the-release" aria-label="get ready for the release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We're inviting our community members to explore these new features in anticipation of our upcoming, <em>official</em> release. As always, feedback is welcome by opening an issue on the <a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML GitHub repository</a>, as a comment here or via our <a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. We're excited to hear what you think!</p>https://dvc.org/blog/dvc-2-0-pre-releasehttps://dvc.org/blog/dvc-2-0-pre-releaseWed, 17 Feb 2021 00:00:00 GMT<h2 id="install" style="position:relative;">Install<a href="#install" aria-label="install permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>First things first. You can install the 2.0 pre-release from the master branch in our repo (instruction <a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">here</a>) or through pip:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> <span class="token parameter variable">--pre</span> dvc</span></code></pre></div> <h2 id="ml-pipelines-parameterization-and-foreach-stages" style="position:relative;">ML pipelines parameterization and foreach stages<a href="#ml-pipelines-parameterization-and-foreach-stages" aria-label="ml pipelines parameterization and foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>After introducing the multi-stage pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, it was quickly adopted among our users. The DVC team got tons of positive feedback from them, as well as feature requests.</p> <h3 id="pipeline-parameters-from-vars" style="position:relative;">Pipeline parameters from <code>vars</code><a href="#pipeline-parameters-from-vars" aria-label="pipeline parameters from vars permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The most requested feature was the ability to use parameters in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. For example. So, you can pass the same seed value or filename to multiple stages in the pipeline.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">vars</span><span class="token punctuation">:</span> <span class="token key atrule">train_matrix</span><span class="token punctuation">:</span> train.pkl <span class="token key atrule">test_matrix</span><span class="token punctuation">:</span> test.pkl <span class="token key atrule">seed</span><span class="token punctuation">:</span> <span class="token number">20210215</span> <span class="token punctuation">...</span> <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">process</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python process.py \ <span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span> \ <span class="token punctuation">-</span><span class="token punctuation">-</span>train $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> \ <span class="token punctuation">-</span><span class="token punctuation">-</span>test $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>test_matrix<span class="token punctuation">}</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> <span class="token punctuation">...</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>seed $<span class="token punctuation">{</span>seed<span class="token punctuation">}</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>train_matrix<span class="token punctuation">}</span></code></pre></div> <p>Also, it gives an ability to localize all the important parameters in a single <code>vars</code> block and play with them. This is a natural thing to do for scenarios like NLP or when hyperparameter optimization is happening not only in the model training code but in the data processing as well.</p> <h3 id="pipeline-parameters-from-params-files" style="position:relative;">Pipeline parameters from params files<a href="#pipeline-parameters-from-params-files" aria-label="pipeline parameters from params files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It is quite common to define pipeline parameters in a config file or a parameters file (like <code>params.yaml</code>) instead of in the pipeline file <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> itself. These parameters defined in <code>params.yaml</code> can also be used in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># params.yaml</span> <span class="token key atrule">models</span><span class="token punctuation">:</span> <span class="token key atrule">us</span><span class="token punctuation">:</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span> <span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span></code></pre></div> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># dvc.yaml</span> <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">build-us</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span> python script.py <span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>models.us.thresh<span class="token punctuation">}</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>models.us.filename<span class="token punctuation">}</span></code></pre></div> <p>DVC properly tracks params dependencies for each stage starting from the previous DVC version 1.0. See the <a href="https://dvc.org/doc/command-reference/run#for-displaying-and-comparing-data-science-experiments" target="_blank" rel="nofollow noopener noreferrer"><code>--params</code> option</a> of <code>dvc run</code> for more details.</p> <h3 id="iterating-over-params-with-foreach-stages" style="position:relative;">Iterating over params with foreach stages<a href="#iterating-over-params-with-foreach-stages" aria-label="iterating over params with foreach stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Iterating over params was a frequently requested feature. Now users can define multiple similar stages with a templatized command.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">build</span><span class="token punctuation">:</span> <span class="token key atrule">foreach</span><span class="token punctuation">:</span> <span class="token key atrule">gb</span><span class="token punctuation">:</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">15</span> <span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-gb.hdf5'</span> <span class="token key atrule">us</span><span class="token punctuation">:</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">10</span> <span class="token key atrule">filename</span><span class="token punctuation">:</span> <span class="token string">'model-us.hdf5'</span> <span class="token key atrule">do</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> <span class="token punctuation">></span><span class="token punctuation">-</span> python script.py <span class="token punctuation">-</span><span class="token punctuation">-</span>out $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span> <span class="token punctuation">-</span><span class="token punctuation">-</span>thresh $<span class="token punctuation">{</span>item.thresh<span class="token punctuation">}</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> $<span class="token punctuation">{</span>item.filename<span class="token punctuation">}</span></code></pre></div> <h2 id="lightweight-ml-experiments" style="position:relative;">Lightweight ML experiments<a href="#lightweight-ml-experiments" aria-label="lightweight ml experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC uses Git versioning as the basis for ML experiments. This solid foundation makes each experiment reproducible and accessible from the project's history. This Git-based approach works very well for ML projects with mature models when only a few new experiments per day are run.</p> <p>However, in more active development, when dozens or hundreds of experiments need to be run in a single day, Git creates overhead — each experiment run requires additional Git commands <code>git add/commit</code>, and comparing all experiments is difficult.</p> <p>We introduce lightweight experiments in DVC 2.0! This is how you can auto-track ML experiments without any overhead from ML engineers.</p> <p>⚠️ Note, our new ML experiment features (<a href="https://dvc.org/doc/command-reference/exp"><code>dvc exp</code></a>) are experimental in the coming release. This means that the commands might change a bit in the following minor releases.</p> <p><a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> can run an ML experiment with a new hyperparameter from <code>params.yaml</code> while <a href="https://dvc.org/doc/command-reference/exp/diff"><code>dvc exp diff</code></a> shows metrics and params difference:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">3000</span> </span> Reproduced experiment(s): exp-bb55c Experiment results have been applied to your workspace. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp diff</span> </span>Path Metric Value Change scores.json auc 0.57462 0.0072197 Path Param Value Change params.yaml featurize.max_features 3000 1500</code></pre></div> <p>More experiments:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">4000</span> </span>Reproduced experiment(s): exp-9bf22 Experiment results have been applied to your workspace. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span> </span>Reproduced experiment(s): exp-63ee0 Experiment results have been applied to your workspace. <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.max_features</span><span class="token operator">=</span><span class="token number">5000</span> <span class="token punctuation">\</span> <span class="token parameter variable">--set-param</span> <span class="token assign-left variable">featurize.ngrams</span><span class="token operator">=</span><span class="token number">3</span> </span>Reproduced experiment(s): exp-80655 Experiment results have been applied to your workspace.</code></pre></div> <p>In the examples above, hyperparameters were changed with the <code>--set-param</code> option, but you can make these changes by modifying the params file instead. In fact <em>any code or data files can be changed</em> and <a href="https://dvc.org/doc/command-reference/exp/run"><code>dvc exp run</code></a> will capture the variations.</p> <p>See all the runs:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span> <span class="token punctuation">\</span> <span class="token parameter variable">--include-params</span> featurize.max_features,featurize.ngrams</span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ───────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>auc<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.max_features<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>featurize.ngrams<span class="token hide">**</span></span> </span> ───────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 0.56359 5000 3 master 0.5674 1500 2 ├── exp-80655 0.56359 5000 3 ├── exp-63ee0 0.5515 5000 2 ├── exp-9bf22 0.56448 4000 2 └── exp-bb55c 0.57462 3000 2 </span> ─────────────────────────────────────────────────────────────────────</code></pre></div> <p>Under the hood, DVC uses Git to store the experiments' meta-information. A straight-forward implementation would create visible branches and auto-commit in them, but that approach would over-pollute the branch namespace very quickly. To avoid this issue, we introduced custom Git references <code>exps</code>, the same way as GitHub uses custom references <code>pulls</code> to track pull requests (this is an interesting technical topic that deserves a separate blog post). Below you can see how it works.</p> <p>No artificial branches, only custom references <code>exps</code> (do not worry if you don't understand this part - it is an implementation detail):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> branch </span>* master <span class="token line"><span class="token input">$ </span><span class="token command">git</span> show-ref </span>5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY 5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH 5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655 f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0 0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c 9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22</code></pre></div> <p>The best experiment can be promoted to the workspace and committed to Git.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp apply</span> exp-bb55c </span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-m</span> <span class="token string">'optimize max feature size'</span></span></code></pre></div> <p>Alternatively, an experiment can be promoted to a branch (<code>big_fr_size</code> branch in this case):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp branch</span> exp-80655 big_fr_size </span>Git branch 'big_fr_size' has been created from experiment 'exp-c695f'. To switch to the new branch run: git checkout big_fr_size</code></pre></div> <p>Remove all the experiments that were not used:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp gc</span> <span class="token parameter variable">--workspace</span> <span class="token parameter variable">--force</span></span></code></pre></div> <h2 id="model-checkpoints" style="position:relative;">Model checkpoints<a href="#model-checkpoints" aria-label="model checkpoints permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>ML model checkpoints are an essential part of deep learning. ML engineers prefer to save the model files (or weights) at checkpoints during a training process and return back when metrics start diverging or learning is not fast enough.</p> <p>The checkpoints create a different dynamic around ML modeling process and need a special support from the toolset:</p> <ol> <li>Track and save model checkpoints (DVC outputs) periodically, not only the final result or training epoch.</li> <li>Save metrics corresponding to each of the checkpoints.</li> <li>Reuse checkpoints - warm-start training with an existing model file, corresponding code, dataset version and metrics.</li> </ol> <p>This new behavior is supported in DVC 2.0. Now, DVC can version all your checkpoints with corresponding code and data. It brings the reproducibility of DL processes to the next level - every checkpoint is reproducible.</p> <p>This is how you define checkpoints with live-metrics:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc stage add</span> <span class="token parameter variable">-n</span> train <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> users.csv <span class="token parameter variable">-d</span> train.py <span class="token punctuation">\</span> <span class="token parameter variable">-p</span> dropout,epochs,lr,process <span class="token punctuation">\</span> <span class="token parameter variable">--checkpoint</span> model.h5 <span class="token punctuation">\</span> <span class="token parameter variable">--live</span> logs <span class="token punctuation">\</span> python train.py </span> Creating 'dvc.yaml' Adding stage 'train' in 'dvc.yaml'</code></pre></div> <p>Note, we use <a href="https://dvc.org/doc/command-reference/stage/add"><code>dvc stage add</code></a> command instead of <code>dvc run</code>. Starting from DVC 2.0 we begin extracting all stage specific functionality under <a href="https://dvc.org/doc/command-reference/stage"><code>dvc stage</code></a> umbrella. <code>dvc run</code> is still working, but will be deprecated in the following major DVC version (most likely in 3.0).</p> <p>Start the training process and interrupt it after 5 epochs:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>'users.csv.dvc' didn't change, skipping Running stage 'train': > python train.py ... ^CTraceback (most recent call last): ... KeyboardInterrupt</code></pre></div> <p>Navigate in checkpoints:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> </span> ────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 4 2.0702 0.30388 2.025 … 5 … master - - - - … 5 … │ ╓ exp-e15bc 4 2.0702 0.30388 2.025 … 5 … │ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 … │ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 … │ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 … │ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 … ├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 … </span> ──────────────────────────────────────────────────────────────────────</code></pre></div> <p>Each of the checkpoints above is a separate experiment with all data, code, paramaters and metrics. You can use the same <a href="https://dvc.org/doc/command-reference/exp/apply"><code>dvc exp apply</code></a> command to extract any of these.</p> <p>Another run continues this process. You can see how accuracy metrics are increasing - DVC does not remove the model/checkpoint and training code trains on top of it:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>Existing checkpoint experiment 'exp-e15bc' will be resumed ... ^C KeyboardInterrupt <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> </span> ────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 9 1.7845 0.58125 1.7381 … 5 … master - - - - … 5 … │ ╓ exp-e15bc 9 1.7845 0.58125 1.7381 … 5 … │ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 … │ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 … │ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 … │ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 … │ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 … │ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 … │ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 … │ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 … │ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 … ├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 … </span> ──────────────────────────────────────────────────────────────────────</code></pre></div> <p>After modifying the code, data, or params, the same process can be resumed. DVC recognizes the change and shows it (see experiment <code>b363267</code>):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> train.py <span class="token comment"># modify code</span> </span><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> params.yaml <span class="token comment"># modify params</span> </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp run</span> </span>Modified checkpoint experiment based on 'exp-e15bc' will be created ... <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc exp show</span> <span class="token parameter variable">--no-pager</span> <span class="token parameter variable">--no-timestamp</span></span></code></pre></div> <div class="gatsby-highlight" data-language="dvctable"><pre class="language-dvctable"><code class="language-dvctable"> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> <span class="token bold"><span class="token hide">**</span>Experiment<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>step<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>accuracy<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>val_loss<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>epochs<span class="token hide">**</span></span> <span class="token bold"><span class="token hide">**</span>…<span class="token hide">**</span></span> </span> ────────────────────────────────────────────────────────────────────────────── <span class="token rows"> workspace 13 1.5841 0.69262 1.5381 … 15 … master - - - - … 5 … │ ╓ exp-7ff06 13 1.5841 0.69262 1.5381 … 15 … │ ╟ 6c62fec 12 1.6325 0.67248 1.5857 … 15 … │ ╟ 4baca3c 11 1.6817 0.64855 1.6349 … 15 … │ ╟ b363267 (2b06de7) 10 1.7323 0.61925 1.6857 … 15 … │ ╓ 2b06de7 9 1.7845 0.58125 1.7381 … 5 … │ ╟ 205a8d3 9 1.7845 0.58125 1.7381 … 5 … │ ╟ dd23d96 8 1.8369 0.54173 1.7919 … 5 … │ ╟ 5bb3a1f 7 1.8929 0.49108 1.8474 … 5 … │ ╟ 6dc5610 6 1.951 0.43433 1.9046 … 5 … │ ╟ a79cf29 5 2.0088 0.36837 1.9637 … 5 … │ ╟ 5ea8327 4 2.0702 0.30388 2.025 … 5 … │ ╟ bc0cf02 3 2.1338 0.23988 2.0883 … 5 … │ ╟ f8cf03f 2 2.1989 0.17932 2.1542 … 5 … │ ╟ 7575a44 1 2.2694 0.12833 2.223 … 5 … ├─╨ a72c526 0 2.3416 0.0959 2.2955 … 5 … </span> ──────────────────────────────────────────────────────────────────────────────</code></pre></div> <p>Sometimes you might need to train the model from scratch. The reset option removes the checkpoint file before training: <a href="https://dvc.org/doc/command-reference/exp/run#--reset"><code>dvc exp run --reset</code></a>.</p> <h2 id="metrics-logging" style="position:relative;">Metrics logging<a href="#metrics-logging" aria-label="metrics logging permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Continuously logging ML metrics is a very common practice in the ML world. Instead of a simple command-line output with the metrics values, many ML engineers prefer visuals and plots. These plots can be organized in a "database" of ML experiments to keep track of a project. There are many special solutions for metrics collecting and experiment tracking such as sacred, mlflow, weight and biases, neptune.ai, or others.</p> <p>With DVC 2.0, we are releasing a new open-source library <a href="https://github.com/iterative/dvclive" target="_blank" rel="nofollow noopener noreferrer">DVC-Live</a> that provides functionality for tracking model metrics and organizing metrics in simple text files in a way that DVC can visualize the metrics with navigation in Git history. So, DVC can show you a metrics difference between the current model and a model in <code>master</code> or any other branch.</p> <p>This approach is similar to the other metrics tracking tools with the difference that Git becomes a "database" or of ML experiments.</p> <h3 id="generate-metrics-file" style="position:relative;">Generate metrics file<a href="#generate-metrics-file" aria-label="generate metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Install the library:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> dvclive</span></code></pre></div> <p>Instrument your code:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvclive <span class="token keyword">from</span> dvclive<span class="token punctuation">.</span>keras <span class="token keyword">import</span> DvcLiveCallback dvclive<span class="token punctuation">.</span>init<span class="token punctuation">(</span><span class="token string">"logs"</span><span class="token punctuation">)</span> <span class="token comment">#, summarize=True)</span> <span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span><span class="token punctuation">.</span><span class="token punctuation">.</span><span class="token punctuation">.</span> <span class="token comment"># Set up DVC-Live callback:</span> callbacks<span class="token operator">=</span><span class="token punctuation">[</span> DvcLiveCallback<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> </code></pre></div> <p>During the training you will see the metrics files that are continuously populated each epochs:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span> logs/ </span>accuracy.tsv loss.tsv val_accuracy.tsv val_loss.tsv <span class="token line"><span class="token input">$ </span><span class="token command">head</span> logs/accuracy.tsv </span>timestamp step accuracy 1613645582716 0 0.7360000014305115 1613645585478 1 0.8349999785423279 1613645587322 2 0.8830000162124634 1613645589125 3 0.9049999713897705 1613645590891 4 0.9070000052452087 1613645592681 5 0.9279999732971191 1613645594490 6 0.9430000185966492 1613645596232 7 0.9369999766349792 1613645598034 8 0.9430000185966492</code></pre></div> <p>In addition to the continuous metrics files, you will see the summary metrics file and HTML file with the same file prefix. The summary file contains the result of the latest epoch:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> logs.json <span class="token operator">|</span> python <span class="token parameter variable">-m</span> json.tool </span>{ "step": 41, "loss": 0.015958430245518684, "accuracy": 0.9950000047683716, "val_loss": 13.705962181091309, "val_accuracy": 0.5149999856948853 }</code></pre></div> <p>The HTML file contains all the visuals for continuous metrics as well as the summary metrics on a single page:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b66f0f1e2076cdf2661acb4f621e7255/39600/dvclive-html.png" alt="dvclive html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Note, the HTML and the summary metrics files are generating automatically for each. So, you can monitor model performance in realtime.</p> <h3 id="git-navigation-with-the-metrics-file" style="position:relative;">Git-navigation with the metrics file<a href="#git-navigation-with-the-metrics-file" aria-label="git navigation with the metrics file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC repository is NOT required to use the live metrics functionality from the above. It works independently from DVC.</p> <p>DVC repository becomes useful when the metrics and plots are committed in your Git repository, and you need navigation around the metrics.</p> <p>Metrics difference between workspace and the last Git commit:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git status</span> <span class="token parameter variable">-s</span> </span> M logs.json M logs/accuracy.tsv M logs/loss.tsv M logs/val_accuracy.tsv M logs/val_loss.tsv M train.py ?? model.h5 <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json </span>Path Metric Old New Change logs.json accuracy 0.995 0.99 -0.005 logs.json loss 0.01596 0.03036 0.0144 logs.json step 41 36 -5 logs.json val_accuracy 0.515 0.5175 0.0025 logs.json val_loss 13.70596 3.29033 -10.41563</code></pre></div> <p>The difference between a particular commit/branch/tag or between two commits:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics diff</span> <span class="token parameter variable">--target</span> logs.json HEAD^ 47b85c </span>Path Metric Old New Change logs.json accuracy 0.995 0.998 0.003 logs.json loss 0.01596 0.01951 0.00355 logs.json step 41 82 41 logs.json val_accuracy 0.515 0.51 -0.005 logs.json val_loss 13.70596 5.83056 -7.8754</code></pre></div> <p>The same Git-navigation works with the plots:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">--target</span> logs </span>file:///Users/dmitry/src/exp-dc/plots.html</code></pre></div> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cdc4ec4dabed1d7de6b8606667ebfc83/39600/dvclive-diff-html.png" alt="dvclive diff html" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Another nice thing about the live metrics - they work across ML experiments and checkpoints, if properly set up in dvc stages. To set up live metrics, you need to specify the metrics directory in the <code>live</code> section of a stage:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">live</span><span class="token punctuation">:</span> <span class="token key atrule">logs</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">summary</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">report</span><span class="token punctuation">:</span> <span class="token boolean important">true</span> <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> data</code></pre></div> <h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>I'd like to thank all of you DVC community members for the feedback that we are constantly getting. This feedback helps us build new functionalities in DVC and make it more stable.</p> <p>Please be in touch with us on <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/february-21-dvc-heartbeathttps://dvc.org/blog/february-21-dvc-heartbeatTue, 16 Feb 2021 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Happy February! Here's all the news to keep you up to date.</p> <h2 id="weve-hired-and-are-still-hiring" style="position:relative;">We've hired and are still hiring!<a href="#weve-hired-and-are-still-hiring" aria-label="weve hired and are still hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have four new team members this month!</p> <p><a href="https://www.linkedin.com/in/david-berenbaum-20b6b424/" target="_blank" rel="nofollow noopener noreferrer"><strong>Dave Berenbaum</strong></a> came to Iterative.ai by way of a <a href="https://github.com/iterative/dvc/pull/2107" target="_blank" rel="nofollow noopener noreferrer">previous contribution</a> to our open source products while working as a Data Science Manager at Captial One. He joins the team as a Technical Product Manager. We are thrilled he's here!</p> <p><a href="https://www.linkedin.com/in/batuhan-osman-taskaya-7803b61a0/" target="_blank" rel="nofollow noopener noreferrer"><strong>Batuhan Taskaya</strong></a> joins us as a DVC Software Engineer working on the Python core. Batuhan is excited to work on open source full time and we are excited to have him do so!</p> <p><a href="https://www.linkedin.com/in/jenifer-de-figueiredo/" target="_blank" rel="nofollow noopener noreferrer"><strong>Jeny De Figueiredo</strong></a> is involved in the Seattle area data science community at Data Circles and is a WiDS Puget Sound Ambassador. She joins us as our new Community Manager and is looking forward to further building and engaging the community in MLOps! (Hi! This is me. 🙋🏻‍♀️ I'll be writing Heartbeat!)</p> <p><a href="https://www.linkedin.com/in/rogermparent/" target="_blank" rel="nofollow noopener noreferrer"><strong>Roger Parent</strong></a> has already been a big part of building DVC and <a href="https://cml.dev/" target="_blank" rel="nofollow noopener noreferrer">CML</a>. He has been a primary developer of a UI that interfaces with the DVC Python application to provide an interface with the Experiments feature that's coming out with DVC 2.0. We are so excited to have him joining us full time as Software Engineer.</p> <p><img src="https://media.giphy.com/media/vAvWgk3NCFXTa/giphy.gif" alt="Search"></p> <h2 id="open-positions" style="position:relative;">Open Positions<a href="#open-positions" aria-label="open positions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We are on the hunt for a <a href="https://docs.google.com/document/d/1aT5HZYt4kAUxXqD4JNTe3jPDlVUwSmnEWDPR2QoKdvo/edit" target="_blank" rel="nofollow noopener noreferrer">TypeScript Front-End Engineer</a> to build SaaS and a VS Code UI for our popular machine learning tools: DVC and CML. The ML tools ecosystem is what JS space was 10 years ago. Come join us on this exciting project!</p> <p>Our search continues for a <a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer">Developer Advocate</a> to support and inspire developers by creating new content like blogs, tutorials, and videos - plus lead outreach through meetups and conferences.</p> <p>Does this sound like you or someone you know? Be in touch!</p> <h2 id="iterativeai-featured-on-the-new-stack" style="position:relative;">Iterative.ai Featured on The New Stack<a href="#iterativeai-featured-on-the-new-stack" aria-label="iterativeai featured on the new stack permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://thenewstack.io/author/susanhall/" target="_blank" rel="nofollow noopener noreferrer">Susan Hall</a> of <a href="https://thenewstack.io/" target="_blank" rel="nofollow noopener noreferrer">The New Stack.io</a> interviewed our very own CEO, <a href="https://twitter.com/fullstackml" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a>, discussing the needs of ML engineers and how Iterative.ai makes tools to enable version control and CI/CD for versioning data and ML models.</p> <blockquote> <p>"ML engineers, they still need collaboration. They need GitHub for collaboration, they need this CI/CD system to resolve [issues] between each other, between the team and productions system." - Dmitry Petrov</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://thenewstack.io/iterative-ai-git-based-machine-learning-tools-for-data-engineers/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Learning Tools for ML Engineers</h4> <div class="elp-description">Susan Hall</div> <div class="elp-link">thenewstack.io</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-02-16/newstack_image-b2d3ce71adb8e6bfee248da2677d5804.png" alt="Learning Tools for ML Engineers"> </div> </a> </section> <p></p> <h2 id="workshops-and-talks" style="position:relative;">Workshops and Talks<a href="#workshops-and-talks" aria-label="workshops and talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="developer-advocacy-for-data-science" style="position:relative;">Developer Advocacy for Data Science<a href="#developer-advocacy-for-data-science" aria-label="developer advocacy for data science permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>So you saw the post further up. 👆🏽 Curious about developer advocacy or what to look for in a hire for this position? <a href="https://twitter.com/drelleobrien" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a> dove into this recently with <a href="https://twitter.com/Al_Grigor" target="_blank" rel="nofollow noopener noreferrer">Alexey Grigorev</a> (author of a <a href="https://mlbookcamp.com/" target="_blank" rel="nofollow noopener noreferrer">Data Science Bookcamp</a>) <a href="https://www.youtube.com/watch?v=jv5W4jXk4P4" target="_blank" rel="nofollow noopener noreferrer">in this podcast</a> on <a href="http://datatalks.club/" target="_blank" rel="nofollow noopener noreferrer">DataTalks.club</a> You can watch it here below. 👇🏼</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jv5W4jXk4P4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="from-the-community" style="position:relative;">From the Community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As ever, we have much to share from the great citizens of the DVC community.</p> <h3 id="spacy-and-dvc-integration" style="position:relative;">spaCy and DVC Integration<a href="#spacy-and-dvc-integration" aria-label="spacy and dvc integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If your NLP team uses spaCy to manage your projects, with spaCy's release of v3.0, you can now enjoy DVC integration to manage your workflow like Git! Check out the <a href="https://spacy.io/usage/projects#integrations" target="_blank" rel="nofollow noopener noreferrer">documentation here</a> to streamline and track your process! 🏆</p> <p> </p><section class="elp-content-holder"> <a href="https://spacy.io/usage/projects#integrations/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">spaCy Integration</h4> <div class="elp-description">spaCy Integration with DVC</div> <div class="elp-link">spacy.io</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-02-16/spacy_integration-5ed0b2ce56d8ed2cad219e7df076dce1.jpg" alt="spaCy Integration"> </div> </a> </section> <p></p> <h3 id="dagshub-and-dvc-integrations" style="position:relative;">DagsHub and DVC Integrations<a href="#dagshub-and-dvc-integrations" aria-label="dagshub and dvc integrations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This month two great articles came out regarding the integration of DAGsHub and DVC. First, this article: [Datasets Should Behave Like Git Repo walks you through the steps to use DVC in your data versioning. The following image shows the dependencies and how you simply need to do a <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a> each time your dataset or model changes to track the process.</p> <p> </p><section class="elp-content-holder"> <a href="https://dagshub.com/blog/datasets-should-behave-like-git-repositories/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Datasets Should Behave Like Git Repositories</h4> <div class="elp-description">Steps to use DVC in your data versioning</div> <div class="elp-link">dagshub.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-02-16/dagshub-logo-d90d994c91caee62972094d181d37c0f.png" alt="Datasets Should Behave Like Git Repositories"> </div> </a> </section> <p></p> <h3 id="did-you-say-works-out-of-the-box" style="position:relative;">Did you say "Works Out of the Box?"<a href="#did-you-say-works-out-of-the-box" aria-label="did you say works out of the box permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Also from DAGsHub, by CEO <a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a>, <a href="https://dagshub.com/blog/dagshub-storage-zero-configuration-dataset-model-hosting/" target="_blank" rel="nofollow noopener noreferrer">Free Dataset & Model Hosting with Zero Configuration - Launching DAGsHub Storage</a> tells how their new DAGsHub storage is a DVC remote that requires zero configuration (!) and will allow for team and organization access controls as well as easy visibility.</p> <p><img src="https://media.giphy.com/media/Ftz07proVX6Rq/giphy.gif" alt="Friends"></p> <h3 id="model-management-and-ml-workflow-orchestration-with-dvc-and-apache-airflow--️" style="position:relative;">Model Management and ML Workflow Orchestration with DVC and Apache Airflow 🇩🇪 ❗️<a href="#model-management-and-ml-workflow-orchestration-with-dvc-and-apache-airflow--%EF%B8%8F" aria-label="model management and ml workflow orchestration with dvc and apache airflow ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We're really excited about a German language workshop led by <a href="https://twitter.com/matthiasniehoff" target="_blank" rel="nofollow noopener noreferrer">Matthias Niehoff</a>! The workshop will be a part of the ML Summit 2021 taking place April 19-21st, but registration closes February 18th. So time is ticking. ⏰ The Conference is online, but will be in German. For more info, head here 👉🏽 for the <a href="https://ml-summit.de/machine-learing/modellmanagement-und-ml-workflow-orchestrierung-mit-dvc-und-apache-airflow/" target="_blank" rel="nofollow noopener noreferrer">Workshop Details</a>.</p> <h3 id="the-most-popular-n1-tool-used-by-teams-on-spell" style="position:relative;">"<em>The</em> most popular 'N+1' tool used by teams on Spell"<a href="#the-most-popular-n1-tool-used-by-teams-on-spell" aria-label="the most popular n1 tool used by teams on spell permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://spell.ml/blog/using-dvc-with-spell-YBHOChEAACgAaSmV" target="_blank" rel="nofollow noopener noreferrer">Using DVC as a Lightweight Feature Store on Spell</a> by <a href="https://twitter.com/ResidentMario" target="_blank" rel="nofollow noopener noreferrer">Aleksey Bilogur</a> , reviews the process of using DVC with Spell for managing changing datasets, enabling team-wide data reproducibility and why Spell fans are DVC fans, and vice versa. 🔄</p> <p><img src="https://media.giphy.com/media/GM8PrUsm92hRC/giphy.gif" alt="Fans"></p> <h2 id="tweet-love-️" style="position:relative;">Tweet Love ❤️<a href="#tweet-love-%EF%B8%8F" aria-label="tweet love ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">How do you deploy a machine learning model?<br><br>Check out my new post! <a href="https://t.co/Qx3RtQ7hO0">https://t.co/Qx3RtQ7hO0</a><br><br>In it we build:<br><br>🚀REST service with <a href="https://twitter.com/tiangolo">@tiangolo</a>'s sleek FastAPI<br>🌐Chrome extension to interact with the model<br>🐳Custom <a href="https://twitter.com/Docker">@Docker</a> images<br>🏇CI/CD with <a href="https://twitter.com/DVCorg">@DVCorg</a> + Github actions</p>— Mihail Eric (@mihail_eric) <a href="https://twitter.com/mihail_eric/status/1357014486377324547">February 3, 2021</a></blockquote> <p>You're all caught up! See you at the next Community Gems 💎!</p> <hr> <p><em>Do you have any use case questions or need support? Join us in <a href="https://discord.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</em></p> <p><em>Head to the <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC Forum</a> to discuss your ideas and best practices.</em></p>https://dvc.org/blog/january-21-community-gemshttps://dvc.org/blog/january-21-community-gemsTue, 26 Jan 2021 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-is-there-an-equivalent-of-git-restore-file-for-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799598181310267392" target="_blank" rel="nofollow noopener noreferrer">Q: Is there an equivalent of <code>git restore <file></code> for DVC?</a><a href="#q-is-there-an-equivalent-of-git-restore-file-for-dvc" aria-label="q is there an equivalent of git restore file for dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes! You'll want <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>. It restores the corresponding verion of your DVC-tracked file or directory from <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory" target="_blank" rel="nofollow noopener noreferrer">the cache</a> to your local workspace. <a href="https://dvc.org/doc/command-reference/checkout#checkout" target="_blank" rel="nofollow noopener noreferrer">Read up in our docs for more info!</a></p> <h3 id="q-my-dataset-is-made-of-more-than-a-million-small-files-can-i-use-an-archive-format-like-targz-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/798983422965841920" target="_blank" rel="nofollow noopener noreferrer">Q: My dataset is made of more than <em>a million</em> small files. Can I use an archive format, like <code>tar.gz</code> with DVC?</a><a href="#q-my-dataset-is-made-of-more-than-a-million-small-files-can-i-use-an-archive-format-like-targz-with-dvc" aria-label="q my dataset is made of more than a million small files can i use an archive format like targz with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are some downsides to using archive formats, and often we discourage it- but let's review some factors to consider, so you can make the best choice for your project.</p> <ul> <li>If your <code>tar.gz</code> file changes at all- perhaps because you changed a single file before zipping- you'll end up with an entirely new copy of the archive every time you commit! This is not very space efficient, but if space isn't an issue it might not be a dealbreaker.</li> <li>Because of the way we optimize data transfer, you'll end up transferring the whole archive anytime you modify a single file and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>/<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>.</li> <li>In general, archives don't play nice with the concept of diffs. Looking back at your git history, it can be challenging to log how files were deleted, modified, or added when you're versioning archives.</li> </ul> <p>While we can't do much about the general issues that archives present for version control systems, DVC does have some options that might help you achieve better data transfer speeds. We recommend exploring DVC's built-in parallelism- data transfer functions like <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> and <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> have a flag (<code>-j</code>) for increasing the number of jobs run simultaneously. <a href="https://dvc.org/doc/command-reference/push#options" target="_blank" rel="nofollow noopener noreferrer">Check out the docs for more details</a>.</p> <p>In summary, the advantage of using an archive format will depend on both how often you modify your dataset and how often you need to push and pull data. You might consider exploring both approaches (with and without compression) and run some speed tests for your use case. We'd love to know what you find!</p> <h3 id="q-my-dvc-remote-is-a-server-with-a-self-signed-certificate-when-i-push-data-dvc-is-giving-me-an-ssl-verification-error--how-can-i-get-around-this" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/800707271502856222" target="_blank" rel="nofollow noopener noreferrer">Q: My DVC remote is a server with a self-signed certificate. When I push data, DVC is giving me an SSL verification error- how can I get around this?</a><a href="#q-my-dvc-remote-is-a-server-with-a-self-signed-certificate-when-i-push-data-dvc-is-giving-me-an-ssl-verification-error--how-can-i-get-around-this" aria-label="q my dvc remote is a server with a self signed certificate when i push data dvc is giving me an ssl verification error how can i get around this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>On S3 or S3-compatible storage, you can configure your AWS CLI to use a custom certificate path. <a href="https://docs.aws.amazon.com/credref/latest/refdocs/setting-global-ca_bundle.html" target="_blank" rel="nofollow noopener noreferrer">As suggested by their docs</a>, you can also set the environment variable <code>AWS_CA_BUNDLE</code> to your <code>.pem</code> file.</p> <p>Similarly, on HTTP and Webdav remotes, there's <code>REQUESTS_CA_BUNDLE</code> environment variable that you can set your self-signed certificate file to.</p> <p>Then, when DVC tries to access your storage, you should be able to get past SSL verification!</p> <h3 id="q-i-want-to-be-able-to-make-my-own-plots-in-python-with-data-points-from-my-dvc-plots-including-older-versions-of-those-plots-what-do-you-recommend-to-get-the-raw-historical-data" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799617584336338954" target="_blank" rel="nofollow noopener noreferrer">Q: I want to be able to make my own plots in Python with data points from my <code>dvc plots</code>, including older versions of those plots. What do you recommend to get the raw historical data?</a><a href="#q-i-want-to-be-able-to-make-my-own-plots-in-python-with-data-points-from-my-dvc-plots-including-older-versions-of-those-plots-what-do-you-recommend-to-get-the-raw-historical-data" aria-label="q i want to be able to make my own plots in python with data points from my dvc plots including older versions of those plots what do you recommend to get the raw historical data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We suggest</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> git <span class="token keyword">import</span> Repo revs <span class="token operator">=</span> Repo<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>plots<span class="token punctuation">.</span>collect<span class="token punctuation">(</span>revs<span class="token operator">=</span>revs<span class="token punctuation">)</span></code></pre></div> <p>Then you can plot the data contained in <code>revs</code> to your heart's content!</p> <h3 id="q-is-it-safe-to-share-a-dvc-remote-between-two-projects-or-registries" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/799216349405904896" target="_blank" rel="nofollow noopener noreferrer">Q: Is it safe to share a DVC remote between two projects or registries?</a><a href="#q-is-it-safe-to-share-a-dvc-remote-between-two-projects-or-registries" aria-label="q is it safe to share a dvc remote between two projects or registries permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can share a remote with as many projects as you like. Because DVC uses content-addressible storage, you'll still get benefits like file deduplication over every project that uses the remote. This can be useful if you're likely to have many shared files across projects.</p> <p>One big thing to watch out for: you have to be very careful with clearing the DVC cache. Make sure you don't remove files associated with another project when running <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> by using the <code>--projects</code> flag. <a href="https://dvc.org/doc/command-reference/gc#options" target="_blank" rel="nofollow noopener noreferrer">Read up in the docs!</a></p> <h3 id="q-can-i-throttle-the-number-of-simultaneous-uploads-to-remote-storage-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/802099863076208662" target="_blank" rel="nofollow noopener noreferrer">Q: Can I throttle the number of simultaneous uploads to remote storage with DVC?</a><a href="#q-can-i-throttle-the-number-of-simultaneous-uploads-to-remote-storage-with-dvc" aria-label="q can i throttle the number of simultaneous uploads to remote storage with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yep! That'll be the <code>-j/--jobs</code> flag, for example:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">-j</span> <span class="token operator"><</span>number<span class="token operator">></span></span></code></pre></div> <p>will control the number of simultaneous uploads DVC attempts when pushing files to your remote storage (<a href="https://dvc.org/doc/command-reference/push#push" target="_blank" rel="nofollow noopener noreferrer">see more in our docs</a>).</p> <h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-i-have-a-dvc-pipeline-that-i-want-to-run-in-cicd-specifically-i-only-want-to-reproduce-the-stages-that-have-changed-since-my-last-commit-what-do-i-do" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/796185815574511616" target="_blank" rel="nofollow noopener noreferrer">Q: I have a DVC pipeline that I want to run in CI/CD. Specifically, I only want to reproduce the stages that have changed since my last commit. What do I do?</a><a href="#q-i-have-a-dvc-pipeline-that-i-want-to-run-in-cicd-specifically-i-only-want-to-reproduce-the-stages-that-have-changed-since-my-last-commit-what-do-i-do" aria-label="q i have a dvc pipeline that i want to run in cicd specifically i only want to reproduce the stages that have changed since my last commit what do i do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC pipelines, like makefiles, will only reproduce stages that DVC detects have changed since the last commit. So to do this in CI/CD systems like GitHub Actions or GitLab CI, you'll want to make sure the workflow a) syncs the runner with the latest version of your pipeline, including all inputs and dependencies, and b) reruns your DVC pipeline.</p> <p>In practice, your workflow needs to include these two commands:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div> <p>You pull the latest version of your pipeline, inputs and dependencies from cloud storage with <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>, and then <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> intelligently reproduces the pipeline (meaning, it should avoid rerunning stages that haven't changed since the last commit).</p> <p>Check out an <a href="https://github.com/iterative/cml_dvc_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">example workflow here</a>.</p> <h3 id="q-im-using-dvc-and-cml-to-pull-data-from-cloud-storage-then-train-a-model-i-want-to-push-the-trained-model-into-cloud-storage-when-im-done-what-should-i-do" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/801553810618187796" target="_blank" rel="nofollow noopener noreferrer">Q: I'm using DVC and CML to pull data from cloud storage, then train a model. I want to push the trained model into cloud storage when I'm done, what should I do?</a><a href="#q-im-using-dvc-and-cml-to-pull-data-from-cloud-storage-then-train-a-model-i-want-to-push-the-trained-model-into-cloud-storage-when-im-done-what-should-i-do" aria-label="q im using dvc and cml to pull data from cloud storage then train a model i want to push the trained model into cloud storage when im done what should i do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>One approach is to run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> <span class="token operator"><</span>model<span class="token operator">></span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token operator"><</span>model<span class="token operator">></span></span></code></pre></div> <p>to the end of your workflow. This will push the model file, but there's a downside: it won't keep a strong link between the pipeline (meaning, the command you used to generate the model and any code/data dependencies) and the model file.</p> <p>What we recommend is that you create a <a href="https://dvc.org/doc/start/data-pipelines#get-started-data-pipelines" target="_blank" rel="nofollow noopener noreferrer">DVC pipeline</a> with one stage- training your model- and declaring your model file as an output. Then, your workflow can look like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># get data</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span> </span> <span class="token comment"># run the pipeline</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> </span> <span class="token comment"># push to remote storage</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">--run-cache</span></span></code></pre></div> <p>When you do this workflow with the <code>--run-cache</code> flags, you'll be able to save all the results of the pipeline in the cloud (<a href="https://dvc.org/doc/command-reference/push#options" target="_blank" rel="nofollow noopener noreferrer">read more here</a>). When the run has completed, you can go to your local workspace and run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div> <p>This will put your model in your local workspace! And, you get an immutable link between the code version, data version and model you end up with.</p> <p>We recommend this approach so you don't lose track of how model files relate to the data and code that produced them. It's a little more work to set up, but Future You will thank you!</p> <p><img src="https://media.giphy.com/media/l0LEIXSRRuv9QQIRNI/giphy.gif" alt="Tim Robinson Reaction GIF by The Lonely Island"></p>https://dvc.org/blog/january-21-dvc-heartbeathttps://dvc.org/blog/january-21-dvc-heartbeatWed, 20 Jan 2021 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Welcome to the first Heartbeat of 2021! Here's some new year news.</p> <h3 id="were-still-hiring" style="position:relative;">We're still hiring<a href="#were-still-hiring" aria-label="were still hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our search continues for a <a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong></a> to support and inspire developers by creating new content like blogs, tutorials, and videos- plus lead outreach through meetups and conferences.</p> <p>Does this sound like you or someone you know? Be in touch!</p> <h3 id="7000-stars-on-github" style="position:relative;">7000 stars on GitHub<a href="#7000-stars-on-github" aria-label="7000 stars on github permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We recently passed 7000 stars on the <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC GitHub repository</a>! We crossed the 7k mark extremely close to midnight on New Year's Eve, so we probably hit it in time for the new year in at least one time zone. Anyway, it made for a very suspenseful countdown to midnight. Woot woot!</p> <p><img src="https://media.giphy.com/media/QAPFLCrpfalPi/giphy.gif" alt="Make Countdown GIF"></p> <p>The repo is HQ for DVC development, meaning- if you have an issue to report, a feature to request, or a pull request to offer, this is where you should start!</p> <h3 id="new-video-for-r-users" style="position:relative;">New video for R users<a href="#new-video-for-r-users" aria-label="new video for r users permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>A lot of our videos about GitHub Actions have used Python scripts, but there's no reason to restrict <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning</a> to one language. We've just released our first-ever R language video, which covers</p> <ul> <li>How to install R on a GitHub Actions runner</li> <li>How to manage R package dependencies for continuous integration (teaser: CRAN binaries are amazing)</li> <li>Putting a <code>ggplot</code> or a <code>kable</code> table in your pull request</li> </ul> <p>Watch and follow along! If you make something based on this approach, or if you think there's a better way, please tell us- we're eager to see what the R community thinks.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/NwUijrm2U2w?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="workshops-and-talks" style="position:relative;">Workshops and talks<a href="#workshops-and-talks" aria-label="workshops and talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>On Friday, January 24, I (Elle) spoke with <a href="https://twitter.com/Al_Grigor" target="_blank" rel="nofollow noopener noreferrer">Alexey Grigorev</a> (author of a <a href="https://mlbookcamp.com/" target="_blank" rel="nofollow noopener noreferrer">Data Science Bookcamp</a>), on his podcast about being a developer advocate in the machine learning space! If you're curious about what the role entails, or what to look for when hiring a developer advocate for your machine learning project, please come by. The event is up on YouTube, and will soon be available as a podcast for your listening pleasure 🎧</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jv5W4jXk4P4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As ever, we have much to share from the great citizens of the DVC community.</p> <h3 id="wheres-baby-yoda" style="position:relative;">Where's Baby Yoda?<a href="#wheres-baby-yoda" aria-label="wheres baby yoda permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There's a brand new blog post we love, and only half of that has to do with its impressive collection of Baby Yoda pics. <a href="https://dagshub.com/blog/author/simon/" target="_blank" rel="nofollow noopener noreferrer">Simon Lousky</a>, developer at <a href="https://dagshub.com" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a>, published a blog provocatively titled <a href="https://dagshub.com/blog/datasets-should-behave-like-git-repositories/" target="_blank" rel="nofollow noopener noreferrer"><em>Datasets should behave like git repositories</em></a>. He writes:</p> <blockquote> <p>While data versioning solves the problem of managing data in the context of your machine learning project, it brings with it a new approach to managing datasets. This approach, also described as data registries here, consists of creating a git repository entirely dedicated to managing a dataset. This means that instead of training models on frozen datasets - something researchers, students, kagglers, and open source machine learning contributors often do - you could link your project to a dataset (or to any file for that matter), and treat it as a dependency. After all, data can and should be treated as code, and follow through a review process.</p> </blockquote> <p>We agree! Lousky goes on to show us a brilliant code example wherein he segments instances of Baby Yoda out of frames from The Mandalorian. DVC plays a key role in keeping track of all the Baby Yodas, which is pretty much the most important use case we could've imagined.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 480px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/291a8f82c6d13846fb7a83a13386b1b6/39600/bb_yoda.png" alt="bb yoda" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Found them!</em></p> <p>There's also a <a href="https://www.reddit.com/r/MachineLearning/comments/l0l0oc/p_datasets_should_behave_like_git_repositories/" target="_blank" rel="nofollow noopener noreferrer">lively discussion about the post on Reddit</a>. Check it out and consider contributing your own Baby Yoda image annotations to grow the dataset!</p> <h3 id="data-version-control-explained" style="position:relative;">Data Version Control Explained<a href="#data-version-control-explained" aria-label="data version control explained permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Researcher <a href="https://blog.crowdbotics.com/author/nimra/" target="_blank" rel="nofollow noopener noreferrer">Nimra Ejaz</a> published a fantastically detailed introduction to DVC. She even included a "History of DVC" section, which is pretty cool for us- this might be a first!</p> <p>Her blog covers not only the key features of DVC, but a thoughtful pros-and-cons list <em>and</em> a case study about using DVC in an image classification project. If you want an up-to-date, high-level overview of DVC and some help deciding if it fits your needs, I couldn't recommend Nimra's blog more.</p> <p> </p><section class="elp-content-holder"> <a href="https://blog.crowdbotics.com/data-version-control-explained/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data Version Control Explained</h4> <div class="elp-description">Nimra Ejaz</div> <div class="elp-link">crowdbotics.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-01-20/crowdbotics-cd9021f03aa5ede1fbe280a356617516.png" alt="Data Version Control Explained"> </div> </a> </section> <p></p> <h3 id="one-more-thing-from-dagshub" style="position:relative;">One more thing from DAGsHub<a href="#one-more-thing-from-dagshub" aria-label="one more thing from dagshub permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a>, CEO of DAGsHub, shared an important update: they now offer FREE dataset and model hosting for DVC projects (up to 10 GB per user and project, with flexibility for public projects)! And with no configuration!</p> <p>That means you don't have to configure your DVC remote to use DVC with model and data storage in the cloud- DAGsHub will handle <em>all</em> of it. Your DVC remote can be added as easily as a Git remote, in other words. Read the announcement, and then dig into their <a href="https://dagshub.com/docs/experiment-tutorial/overview/" target="_blank" rel="nofollow noopener noreferrer">basic tutorial</a> to get started.</p> <p> </p><section class="elp-content-holder"> <a href="https://dagshub.com/blog/dagshub-storage-zero-configuration-dataset-model-hosting/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Free Dataset & Model Hosting with Zero Configuration – Launching DAGsHub Storage</h4> <div class="elp-description">Dean Pleban</div> <div class="elp-link">dagshub.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2021-01-20/dagshub-aa036fbcd9874d7c399ca6ef36cfc846.jpg" alt="Free Dataset & Model Hosting with Zero Configuration – Launching DAGsHub Storage"> </div> </a> </section> <p></p> <h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/bibryam" target="_blank" rel="nofollow noopener noreferrer">Bilgin Ibryam</a>, author of the <a href="https://www.redhat.com/en/engage/kubernetes-containers-architecture-s-201910240918" target="_blank" rel="nofollow noopener noreferrer">Kubernetes Patterns</a> book, gave us a shoutout for being an interesting data engineering project (according to a list by another expert we trust, <a href="https://twitter.com/squarecog" target="_blank" rel="nofollow noopener noreferrer">Dmitry Ryabov</a>). Thanks Bilgin and Dmitry, we think you're very interesting too!</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Five Interesting Data Engineering Projects (<a href="https://twitter.com/getdbt">@getdbt</a>, <a href="https://twitter.com/PrefectIO">@PrefectIO</a>, <a href="https://twitter.com/dask_dev">@dask_dev</a>, <a href="https://twitter.com/DVCorg">@DVCorg</a>, greatexpectations)<a href="https://t.co/XXeLXYDp0M">https://t.co/XXeLXYDp0M</a> by <a href="https://twitter.com/squarecog">@squarecog</a></p>— Bilgin Ibryam (@bibryam) <a href="https://twitter.com/bibryam/status/1341777034448650242">December 23, 2020</a></blockquote>https://dvc.org/blog/december-20-community-gemshttps://dvc.org/blog/december-20-community-gemsWed, 30 Dec 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-is-there-a-way-to-plot-all-columns-in-a-csv-file-on-a-single-graph-using-dvc-plot" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/768689062314770442" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to plot all columns in a <code>.csv</code> file on a single graph using <code>dvc plot</code>?</a><a href="#q-is-there-a-way-to-plot-all-columns-in-a-csv-file-on-a-single-graph-using-dvc-plot" aria-label="q is there a way to plot all columns in a csv file on a single graph using dvc plot permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>By default, <code>dvc plot</code> graphs one or two columns from the metric file of your choice (use the <code>-x</code> and <code>-y</code> flags to specify which columns).</p> <p>However, there's nothing special about the way DVC makes plots. The plot function is a wrapper for the <a href="https://vega.github.io/vega-lite-v1/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite</a> grammar, which can make pretty much any kind of plot you can imagine. If you check inside <code>.dvc/plots/</code>, you'll see a few Vega-Lite template files- that's where the plotting instructions are stored!</p> <p>You can create your own, or modify the existing templates, by <a href="https://dvc.org/doc/command-reference/plots#plot-templates" target="_blank" rel="nofollow noopener noreferrer">following the instructions in our docs</a>. In short, you'll create a new template and then run <code>dvc plot show -t <name-of-template></code> to use it!</p> <p>Vega-Lite has an <a href="https://vega.github.io/editor/#/" target="_blank" rel="nofollow noopener noreferrer">interactive template editor online</a>, which might help you test out ideas. Happy creating, and if you come up with a template you'd like to share with the DVC community, <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">consider opening a pull request!</a></p> <h3 id="q-my-teammate-and-i-are-having-some-issues-keeping-our-workplaces-synced-were-tracking-some-folders-with-dvc-and-he-recently-added-a-new-file-to-each-of-these-folders-how-does-he-update-the-tracked-folder-and-push-the-new-contents-so-i-can-access-them-too" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/785965719367843860" target="_blank" rel="nofollow noopener noreferrer">Q: My teammate and I are having some issues keeping our workplaces synced. We're tracking some folders with DVC, and he recently added a new file to each of these folders. How does he update the tracked folder and push the new contents so I can access them, too?</a><a href="#q-my-teammate-and-i-are-having-some-issues-keeping-our-workplaces-synced-were-tracking-some-folders-with-dvc-and-he-recently-added-a-new-file-to-each-of-these-folders-how-does-he-update-the-tracked-folder-and-push-the-new-contents-so-i-can-access-them-too" aria-label="q my teammate and i are having some issues keeping our workplaces synced were tracking some folders with dvc and he recently added a new file to each of these folders how does he update the tracked folder and push the new contents so i can access them too permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Your partner should first run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> <span class="token operator"><</span>folder<span class="token operator">></span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div> <p>to update DVC about the new file and then push its contents to remote storage. Next, they'll run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token operator"><</span>folder<span class="token operator">></span>.dvc </span><span class="token line"><span class="token input">$ </span><span class="token git">git push</span></span></code></pre></div> <p>to update your shared Git repository. Then you can do a <code>git pull</code> and <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to sync the changes with your local workspace!</p> <h3 id="q-i-forgot-to-declare-a-metric-output-in-my-dvcyaml-file-so-one-of-my-metrics-is-currently-untracked-how-can-i-fix-this-without-rerunning-the-stage-it-takes-a-long-time-to-run" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/781643749050155009" target="_blank" rel="nofollow noopener noreferrer">Q: I forgot to declare a metric output in my <code>dvc.yaml</code> file, so one of my metrics is currently untracked. How can I fix this without rerunning the stage? It takes a long time to run.</a><a href="#q-i-forgot-to-declare-a-metric-output-in-my-dvcyaml-file-so-one-of-my-metrics-is-currently-untracked-how-can-i-fix-this-without-rerunning-the-stage-it-takes-a-long-time-to-run" aria-label="q i forgot to declare a metric output in my dvcyaml file so one of my metrics is currently untracked how can i fix this without rerunning the stage it takes a long time to run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>No problem- what you'll want to do is edit your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> case and then run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit dvc.yaml</code></a> to store the change.</p> <p><a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> is a helpful function that updates your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file and <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files as needed, which forces DVC to accept any modifications to tracked data currently in your workspace. That should cover the case where you have a metric file from your last pipeline run in your workspace, but forgot to add it to the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> as an output!</p> <p><a href="https://dvc.org/doc/command-reference/commit#commit" target="_blank" rel="nofollow noopener noreferrer">Check out the docs</a> for more about <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> and how it can help you edit pipeline dependencies as you work.</p> <h3 id="q-can-i-have-multiple-dvcyaml-files" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/784083794583486496" target="_blank" rel="nofollow noopener noreferrer">Q: Can I have multiple <code>dvc.yaml</code> files?</a><a href="#q-can-i-have-multiple-dvcyaml-files" aria-label="q can i have multiple dvcyaml files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes. The catch is that they have to be in separate directories. For example, you can define independent pipelines in a <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file each. It's also possible to spread a single pipeline into more than one <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. DVC analyzes all of them to rebuild the DAG(s), for example during <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>.</p> <h3 id="q-i-want-to-work-on-my-dvc-pipeline-on-a-different-computer-than-usual-for-the-stage-im-developing-i-dont-need-access-to-all-the-data-dependencies-of-the-earlier-stages--is-there-a-way-to-download-only-what-i-need" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/788068487246512158" target="_blank" rel="nofollow noopener noreferrer">Q: I want to work on my DVC pipeline on a different computer than usual. For the stage I'm developing, I don't need access to all the data dependencies of the earlier stages- is there a way to download only what I need?</a><a href="#q-i-want-to-work-on-my-dvc-pipeline-on-a-different-computer-than-usual-for-the-stage-im-developing-i-dont-need-access-to-all-the-data-dependencies-of-the-earlier-stages--is-there-a-way-to-download-only-what-i-need" aria-label="q i want to work on my dvc pipeline on a different computer than usual for the stage im developing i dont need access to all the data dependencies of the earlier stages is there a way to download only what i need permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Say for example that you have a pipeline like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">+----------+ | data.dvc | +----------+ * * * +----+ | s1 | +----+ * * * +----+ | s2 | +----+ * * * +----+ | s3 | +----+</code></pre></div> <p>where stage <code>s2</code> is frozen (meaning, its dependencies will not change and we can be reasonably sure the outputs of <code>s2</code> are static).</p> <p>To work on stage <code>s3</code> in a new workspace, you could run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> s2 </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> s3</span></code></pre></div> <p>This set of commands will pull only the targeted stage (not the data corresponding to <code>data.dvc</code>), and then execute the final stage of your pipeline only.</p> <h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-why-do-you-need-docker-to-run-cml" style="position:relative;"><a href="https://www.youtube.com/watch?v=rVq-SCNyxVc&lc=UgzohiMVxO1GKB30bad4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: Why do you need Docker to run CML?</a><a href="#q-why-do-you-need-docker-to-run-cml" aria-label="q why do you need docker to run cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Even though we use Docker in many of our tutorials, you technically <em>don't</em> need it at all! Here's what's going on:</p> <p>We use a custom Docker container that comes with the CML functions installed (as well as some useful data science tools like Python, Vega-Lite, and CUDA drivers). If you want to use your own Docker container, that's fine too- just make sure you install the CML library of functions on your runner.</p> <p>To install CML as an <code>npm</code> package on your runner, we recommend:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">npm i -g @dvcorg/cml</code></pre></div> <p>Once this is done, you should be able to execute functions like <code>cml publish</code> and <code>cml send-comment</code> on your runner.</p> <p>For more tips about using CML without Docker, <a href="https://github.com/iterative/cml#install-cml-as-a-package" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p> <h3 id="q-im-using-cml-to-print-a-dvc-metrics-diff-to-my-pull-request-in-github-but-im-getting-an-error-token-not-found-what-does-that-mean" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/786382971706933258" target="_blank" rel="nofollow noopener noreferrer">Q: I'm using CML to print a <code>dvc metrics diff</code> to my pull request in GitHub, but I'm getting an error: <code>token not found</code>. What does that mean?</a><a href="#q-im-using-cml-to-print-a-dvc-metrics-diff-to-my-pull-request-in-github-but-im-getting-an-error-token-not-found-what-does-that-mean" aria-label="q im using cml to print a dvc metrics diff to my pull request in github but im getting an error token not found what does that mean permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Generally, <code>token</code> refers to an authorization token that grants your runner certain permissions with the GitHub API- such as the ability to post a comment on your pull request. If you're working in GitHub, you don't have to follow any manual steps to create a token. But you <em>do</em> need to make sure your environmental variables in the workflow are named properly.</p> <p>Make sure you've specified the following field in your workflow file:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GITHUB_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span></code></pre></div> <p>The variable must be called <code>repo_token</code> for CML to recognize it!</p> <p>A few other pointers:</p> <ul> <li>In GitLab, you have to set a variable in your repository called <code>repo_token</code> whose value is Personal Access token. We have <a href="https://github.com/iterative/cml/wiki/CML-with-GitLab#variables" target="_blank" rel="nofollow noopener noreferrer">step-by-step instructions in our docs</a>. Forgetting to set this is the #1 issue we see with first-time GitLab CI users!</li> <li>In BitBucket Cloud, you need to set a variable in your repository called <code>repo_token</code> whose value is your API credentials. We have <a href="https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud#repository-variables" target="_blank" rel="nofollow noopener noreferrer">detailed docs for creating this token</a>, too.</li> <li>Need to see more sample workflows to get a feel for it? We have plenty <a href="https://dvc.org/doc/cml#case-studies" target="_blank" rel="nofollow noopener noreferrer">of case studies</a> to examine.</li> </ul> <h3 id="q-is-there-any-reason-why-an-experimental-dvc-feature-wouldnt-work-on-the-cml-docker-container" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/788512890394247178" target="_blank" rel="nofollow noopener noreferrer">Q: Is there any reason why an experimental DVC feature wouldn't work on the CML Docker container?</a><a href="#q-is-there-any-reason-why-an-experimental-dvc-feature-wouldnt-work-on-the-cml-docker-container" aria-label="q is there any reason why an experimental dvc feature wouldnt work on the cml docker container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Generally, no- the container <code>dvcorg/cml:latest</code> should have the latest DVC release and the latest CML release (you can see where DVC and CML are installed from in our <a href="https://github.com/iterative/cml/blob/master/Dockerfile" target="_blank" rel="nofollow noopener noreferrer">Dockerfile</a>). So besides the time it takes for releases to be published on various package managers, there shouldn't be any lag. That means experimental features are ready to play on your runner!</p> <p>Note that you can also install pre-release versions of DVC- check out our <a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">docs about installing the latest stable version ahead of official releases</a>.</p>https://dvc.org/blog/december-20-dvc-heartbeathttps://dvc.org/blog/december-20-dvc-heartbeatFri, 18 Dec 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Welcome to the December Heartbeat! Let's dive in with some news from the team.</p> <h3 id="were-still-hiring" style="position:relative;">We're still hiring<a href="#were-still-hiring" aria-label="were still hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our search continues for two roles:</p> <ul> <li> <p>A <a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Software Engineer</strong></a> for the core DVC team- someone with strong Python development skills who can build and ship essential DVC features.</p> </li> <li> <p>A <a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong></a> to support and inspire developers by creating new content like blogs, tutorials, and videos- plus lead outreach through meetups and conferences.</p> </li> </ul> <p>Does this sound like you or someone you know? Be in touch!</p> <h3 id="video-docs-complete" style="position:relative;">Video docs complete!<a href="#video-docs-complete" aria-label="video docs complete permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>As you may have heard <a href="https://dvc.org/blog/november-20-dvc-heartbeat" target="_blank" rel="nofollow noopener noreferrer">last month</a>, we've been working on adding complete video docs to the "Getting Started" section of the DVC site. We now have 100% coverage! We have videos that mirror the tutorials for:</p> <ul> <li> <p><a href="https://dvc.org/doc/start/data-and-model-versioning" target="_blank" rel="nofollow noopener noreferrer">Data versioning</a> - how to use Git and DVC together to track different versions of a dataset</p> </li> <li> <p><a href="https://dvc.org/doc/start/data-and-model-access" target="_blank" rel="nofollow noopener noreferrer">Data access</a> - how to share models and datasets across projects and environments</p> </li> <li> <p><a href="https://dvc.org/doc/start/data-pipelines" target="_blank" rel="nofollow noopener noreferrer">Pipelines</a> - how to create reproducible pipelines to transform datasets to features to models</p> </li> <li> <p><a href="https://dvc.org/doc/start/experiments" target="_blank" rel="nofollow noopener noreferrer">Experiments</a> - how to do a <code>git diff</code> for models that compares and visualizes metrics</p> </li> </ul> <p><img src="https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif" alt="Mission Accomplished GIF by memecandy"></p> <p>The <a href="https://www.youtube.com/playlist?list=PL7WG7YrwYcnDb0qdPl9-KEStsL-3oaEjg" target="_blank" rel="nofollow noopener noreferrer">full playlist is on our YouTube channel</a>- where, by the way, we've recently passed 2,000 subscribers! Thanks so much for your support. There's much more coming up soon.</p> <h3 id="collaboration-with-gitlab" style="position:relative;">Collaboration with GitLab<a href="#collaboration-with-gitlab" aria-label="collaboration with gitlab permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We recently released a new blog with GitLab all about using <a href="cml.dev">CML</a> with GitLab CI.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">The team behind <a href="https://t.co/At942BC7sF">https://t.co/At942BC7sF</a> released an open source project called CML (continuous machine learning). <br><br>Learn more about GitLab ➕ <a href="https://twitter.com/DVCorg">@DVCorg</a>! <a href="https://t.co/eD8loo4mT5">https://t.co/eD8loo4mT5</a></p>— 🦊 GitLab (@gitlab) <a href="https://twitter.com/gitlab/status/1334631001956487171">December 3, 2020</a></blockquote> <p>You may notice that the tweet spelled our name differently, and since Twitter doesn't have an edit button, I think that means we're "Interative" now. <a href="https://www.zazzle.com/t_shirt-235920696568133954" target="_blank" rel="nofollow noopener noreferrer">Hurry up and get your merch!</a></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 536px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/3e2ee29409886ff96de8060077295dcd/39600/newname.png" alt="newname" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="workshops" style="position:relative;">Workshops<a href="#workshops" aria-label="workshops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We gave a workshop at a virtual meetup held by the <a href="https://mlopsworld.com/about-us/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Society</a>, and you can catch a video recording if you missed it. This workshop was all about getting started with GitHub Actions and CML! It starts with some high-level overview and then gets into live-coding.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/51H13lfHdMw?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There's no shortage of cool things to report from the community:</p> <h3 id="the-dvc-udemy-course" style="position:relative;">The DVC Udemy Course<a href="#the-dvc-udemy-course" aria-label="the dvc udemy course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Now you can learn the fundamentals of machine learning engineering, from experiment tracking to data management to continuous integration, with DVC and Udemy! Data scientists/DVC ambassadors <a href="https://www.udemy.com/user/mnrozhkov/" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> and <a href="https://www.udemy.com/user/marcel-da-camara-ribeiro-dantas/" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a> created a course full of <a href="https://www.udemy.com/course/machine-learning-experiments-and-engineering-with-dvc/?referralCode=68BEB2A7E246A54E5E35" target="_blank" rel="nofollow noopener noreferrer">practical tips and tricks for learners of all levels</a>.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.udemy.com/course/machine-learning-experiments-and-engineering-with-dvc/?referralCode=68BEB2A7E246A54E5E35" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Machine Learning Experiments and Engineering with DVC</h4> <div class="elp-description">Automate machine learning experiments, pipelines and model deployment (CI/CD, MLOps) with Data Version Control (DVC).</div> <div class="elp-link">udemy.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-12-18/udemy-90fceb9dfeae3078b718199d02bfd2d3.png" alt="Machine Learning Experiments and Engineering with DVC"> </div> </a> </section> <p></p> <h3 id="a-proposal-for-git-flow-with-dvc" style="position:relative;">A proposal for Git-flow with DVC<a href="#a-proposal-for-git-flow-with-dvc" aria-label="a proposal for git flow with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://www.uni-augsburg.de/en/fakultaet/fai/informatik/prof/swtpvs/team/fabian-rabe/" target="_blank" rel="nofollow noopener noreferrer">Fabian Rabe</a> at <a href="https://www.uni-augsburg.de/en/" target="_blank" rel="nofollow noopener noreferrer">Universität Augsburg</a> wrote a killer doc about his team's tried-and-true approach to creating a workflow for a DVC project. He writes,</p> <blockquote> <p>Over the past couple of months we have started using DVC in our small team. With a handful of developers all coding, training models & committing in the same repository, we soon realized the need for a workflow.</p> </blockquote> <p>The post outlines three strategies his team adopted:</p> <ol> <li> <p>Create a "debugging dataset" containing a subset of your data, with which you can test your complete DVC pipeline locally on a developer's machine</p> </li> <li> <p>Use CI-Runners to execute the DVC pipeline on the full dataset</p> </li> <li> <p>Adopt a naming convention for Git branches that correspond to machine learning experiments, in addition to the usual feature branches</p> </li> </ol> <p>Agree? Disagree? Fabian is actively soliciting feedback on his proposal (and possible solutions for some unresolved issues), so please read and <a href="https://discuss.dvc.org/t/git-flow-for-dvc/578/6" target="_blank" rel="nofollow noopener noreferrer">chime in on our discussion board</a>.</p> <p> </p><section class="elp-content-holder"> <a href="https://git.rz.uni-augsburg.de/rabefabi/git-flow-for-dvc" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Git Flow for DVC</h4> <div class="elp-description">Fabian Rabe</div> <div class="elp-link">git.rz.uni-augsburg.de</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-12-18/universitat_augs-72cc857d548d5f6bae11cf544b62c097.jpg" alt="Git Flow for DVC"> </div> </a> </section> <p></p> <h3 id="channel-9-talks-machine-learning-and-python" style="position:relative;">Channel 9 talks Machine Learning and Python<a href="#channel-9-talks-machine-learning-and-python" aria-label="channel 9 talks machine learning and python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://channel9.msdn.com/Shows/AI-Show" target="_blank" rel="nofollow noopener noreferrer">The AI Show on Channel 9</a>, part of the Microsoft DevRel universe, put out an episode all about ML and scientific computing with Python featuring <a href="https://twitter.com/ixek" target="_blank" rel="nofollow noopener noreferrer">Tania Allard</a> and <a href="https://twitter.com/sethjuarez" target="_blank" rel="nofollow noopener noreferrer">Seth Juarez</a>. Their episode includes how DVC can fit in this development toolkit, so check it out!</p> <div class="gatsby-resp-iframe-wrapper" style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden; "> <iframe src="https://channel9.msdn.com/Shows/AI-Show/Machine-Learning-and-Scientific-Computing-with-Python/player" allowfullscreen frameborder="0" title="Machine Learning and Scientific Computing with Python - Microsoft Channel 9 Video" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> <h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We'll end on a tweet we love:</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I learned quite a bit in <a href="https://twitter.com/visenger">@visenger</a>'s talk about 10 fundamental practices for Machine Learning engineering. <br><br>Here is my <a href="https://twitter.com/hashtag/sketchnote?src=hash&ref_src=twsrc%5Etfw">#sketchnote</a> <a href="https://twitter.com/hashtag/INNOQTechnologyDay?src=hash&ref_src=twsrc%5Etfw">#INNOQTechnologyDay</a> <a href="https://t.co/tQjRrJq993">pic.twitter.com/tQjRrJq993</a></p>— Joy Heron (@iamjoyheron) <a href="https://twitter.com/iamjoyheron/status/1336698583689596929">December 9, 2020</a></blockquote> <p>This beautiful diagram, made by <a href="https://twitter.com/iamjoyheron" target="_blank" rel="nofollow noopener noreferrer">Joy Heron</a> in response to a talk by <a href="https://twitter.com/visenger" target="_blank" rel="nofollow noopener noreferrer">Dr. Larysa Visengeriyeva</a> about MLOps, is a wonderful encapsulation of the many considerations (at many scales) that go into ML engineering. Do you see DVC in there? 🕵️</p> <p>Thank you for reading, and happy holidays to you! ❄️ 🎁 ☃️</p>https://dvc.org/blog/dvc-vs-rclonehttps://dvc.org/blog/dvc-vs-rcloneThu, 26 Nov 2020 00:00:00 GMT<p>Many general-use tools are available for synchronizing data to and from cloud storage, some widely used options are <a href="https://rsync.samba.org/" target="_blank" rel="nofollow noopener noreferrer">rsync</a>, <a href="https://rclone.org/" target="_blank" rel="nofollow noopener noreferrer">rclone</a> and <a href="https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html" target="_blank" rel="nofollow noopener noreferrer">aws sync</a>, each with their own advantages and disadvantages. Likewise, in <a href="https://dvc.org/">DVC</a> we provide the ability to efficiently sync versioned datasets to and from cloud storage through a git-like push and pull <a href="https://dvc.org/doc/start/data-management/data-versioning" target="_blank" rel="nofollow noopener noreferrer">interface</a>.</p> <p>Given that transferring data over a network to and from cloud storage is an inherently slow operation, it's important for data sync tools to optimize performance wherever possible. While the data transfer itself may be the most apparent performance bottleneck in the data sync process, <strong>here we'll cover a less obvious performance issue: How to determine which files to upload and download.</strong></p> <p>In this post, we'll outline the general methods used to solve this problem, and investigate each method's effects on performance by comparing benchmark results from DVC and rclone. We'll then conclude with a more in-depth explanation of new optimizations made in DVC 1.0 which enabled us to outperform both older DVC releases as well as general data sync tools (like rclone).</p> <p><em>Note: "Cloud storage" and "remote storage" will be used interchangeably throughout this post. When discussing dataset size in this post, we mean size in terms of total number of files in a dataset, rather than the total amount of file data (bytes).</em></p> <h3 id="outline" style="position:relative;">Outline<a href="#outline" aria-label="outline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li><a href="#why-a-trivial-problem-has-a-not-so-trivial-performance-impact">Why a "trivial" problem has a not-so-trivial performance impact</a></li> <li><a href="#real-world-numbers---dvc-and-rclone-performance-examples">Real-world numbers - DVC and rclone performance examples</a></li> <li><a href="#how-dvc-10-speeds-things-up">How DVC 1.0 speeds things up</a></li> <li><a href="#conclusion">Conclusion</a></li> </ul> <h2 id="why-a-trivial-problem-has-a-not-so-trivial-performance-impact" style="position:relative;">Why a "trivial" problem has a not-so-trivial performance impact<a href="#why-a-trivial-problem-has-a-not-so-trivial-performance-impact" aria-label="why a trivial problem has a not so trivial performance impact permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>At the start of any data sync operation, we must first do the following steps, in order to determine which files to upload and download between the local machine and cloud storage:</p> <ol> <li>Determine which files are present locally.</li> <li>Query the cloud storage API to determine which files are present in the cloud.</li> <li>Compute the difference between the two sets of files.</li> </ol> <p>Once this difference in file status has been determined, the necessary files can be copied to or from cloud storage as needed ("file status" meaning file existence as well as other potential status information, such as modification time). <strong>While this may seem like a trivial problem, the second step is actually a significant potential performance bottleneck.</strong></p> <p>In general, cloud storage APIs provide two possible ways to determine what files are present in cloud storage, and it's up to the data sync tool to select which method to use. Even for an operation as simple as synchronizing a single local file to cloud storage, choosing incorrectly between these two options could actually mean the difference between that "simple" operation taking several hours to complete instead of just a few seconds.</p> <p><em>Note: The term "file status query" will be used throughout this post when referring to this type of cloud storage API query.</em></p> <h3 id="method-1-query-individual-files" style="position:relative;">Method 1: Query individual files<a href="#method-1-query-individual-files" aria-label="method 1 query individual files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The first query method is to individually check whether or not particular files exist in cloud storage, one at a time.</p> <p><em>Ex: The S3 API provides the <code>HeadObject</code> method.`</em></p> <p>When using this method, performance depends on the number of files being queried - for a single file, it would take a single API request, for 1 million files, it would take 1 million API requests. In this case, the overall amount of time it will take to complete the full operation will scale with the number of files to query.</p> <p>One particular advantage to using this method is that it can be easily parallelized. Overall runtime can be improved by making simultaneous API requests to query for multiple files at once.</p> <h3 id="method-2-query-full-remote-listing" style="position:relative;">Method 2: Query full remote listing<a href="#method-2-query-full-remote-listing" aria-label="method 2 query full remote listing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The second query method is to request the full listing of files present in cloud storage, all at once.</p> <p><em>Ex: The S3 API provides the <code>ListObjects</code> method.</em></p> <p>With this method, the overall amount of time it will take to complete the full operation scales with the total number of files in cloud storage, rather than the number of files we wish to query.</p> <p>It's important to note that when using this method, cloud APIs will only return a certain number of files at a time (the amount returned varies depending on the API). This means that for an API which returns 1000 files at a time (such as S3), retrieving the full listing of a remote containing 1000 files or less would would only take a single API request. Listing a remote which contains 1 million files would take 1000 API requests.</p> <p>Another important note is that API calls for this method must be made sequentially and can't be easily parallelized. Using S3 as an example, the first API call would return files 0 through 999. The next call would return files 1000 through 1999, and so on. However, the API provides no guarantee of ordering, and API calls must be made sequentially, until the full list has been retrieved. So we can't make two simultaneous requests for both "files 1-999" and "files 1000-1999".</p> <h3 id="how-selecting-one-method-or-the-other-can-drastically-improve-performance" style="position:relative;">How selecting one method or the other can drastically improve performance<a href="#how-selecting-one-method-or-the-other-can-drastically-improve-performance" aria-label="how selecting one method or the other can drastically improve performance permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Consider an example scenario where a dataset being synchronized contains 100 local files, and we need to check which of those files exist in cloud storage. For the purposes of this example, we'll also assume that all individual API calls take the same amount of time to complete, and that we are not running any tasks in parallel. Additionally, let's say that our example cloud storage API returns 1000 files per page when using query method 2.</p> <p>In this situation, we know that the first query method will always take a fixed number of API calls to complete (100). The number of API calls required for the second query method depends on the total number of files that already exist in the remote.</p> <p>Since we know that the API returns 1000 results per API call, we can say that if the remote contains less than <code>1000 * 100 = 100,000</code> files, fetching the full remote listing (method 2) will be faster than checking each file individually, since it will take less than 100 API calls to complete. In the case that the remote contains 1000 or less files, method 2 would only require a single API call (potentially outperforming method 1 by 100x).</p> <p>However, if the remote contains anything over this 100,000 threshold, method 1 will be faster than method 2, with the difference in performance between the two methods scaling linearly as the potential remote size increases.</p> <p><strong>Total API calls required to query 100 local files from S3</strong> <img src="https://dvc.org/2020-11-26/api_calls_100_local-72e1167532070d287193c1edc06d31ec.svg" alt="API calls" title="API calls required to query 100 local files from S3"></p> <p>This example illustrates an important point. Given a (relatively) small set of files to query and a sufficiently large remote, method 1 will always be faster than method 2.</p> <p>Thinking about it from a different perspective, what happens if we have the ability to reduce the size of a (relatively) large query set?</p> <p>Once our query set is smaller than a certain threshold, we'll be able to use method 1 rather than method 2. On top of that, we know that the runtime of method 1 scales with query set size. <strong>In simple terms, by reducing the size of our query set as much as possible, we can also improve performance.</strong></p> <p>So, as we have shown, choosing the optimal method depends on both:</p> <ul> <li>The number of files that we need to query.</li> <li>The total number of files in the remote.</li> </ul> <p><em>Note: In terms of real world performance, there are other considerations that DVC must account for, such as different API calls taking different amounts of time to complete, parallelization, and the amount of time it takes to run list comparison operations in Python.</em></p> <h2 id="real-world-numbers---dvc-and-rclone-performance-examples" style="position:relative;">Real-world numbers - DVC and rclone performance examples<a href="#real-world-numbers---dvc-and-rclone-performance-examples" aria-label="real world numbers dvc and rclone performance examples permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now let's take a look at some real-world numbers to examine the impact selecting one query method or the other has on data sync performance in DVC and rclone. Both tools can utilize either potential query method, with some differences:</p> <ul> <li>In rclone, the user can specify the <code>--no-traverse</code> option to select the first query method, otherwise rclone will default to the second method in most situations (with the exception being cases with very small query set sizes).</li> <li>In DVC prior to 1.0, the first query method would be used by default for all supported cloud storage platforms except Google Drive, and the user could specify one method or the other via the <code>no_traverse</code> configuration option.</li> <li><strong>In DVC 1.0 and later, the optimal query method is selected automatically.</strong></li> </ul> <p>In the following scenarios, we are simulating the typical DVC use case in which a user tracks a local directory containing some number of files using DVC, and then synchronizes the DVC-tracked directory to cloud storage (S3 in these examples) using either DVC or rclone. The user would then continually repeat a process of:</p> <ol> <li>Modify a small subset of files in the directory.</li> <li>Push the updated version of the directory into cloud storage.</li> </ol> <p>Keep in mind that for DVC's purposes, we are most interested in optimizing performance for scenarios which are normally very slow to complete. If you consider an operation which previously took several hours to complete, improving that runtime down to a few minutes will have a much greater impact for our users versus shaving a few seconds off of an operation which previously took under a minute to run.</p> <p><em>Note: For these benchmarks we are only interested in the amount of time required to determine file status for this one-way push operation. So the runtimes in each case are for status queries only (using <a href="https://dvc.org/doc/command-reference/status#-c"><code>dvc status -c</code></a> in DVC and <code>rclone copy --dry-run</code> in rclone). No file data was transferred to or from S3 in any of these scenarios.</em></p> <p><em>Benchmark command usage:</em></p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">time</span> dvc status <span class="token parameter variable">-c</span> <span class="token parameter variable">-r</span> remote </span><span class="token line"><span class="token input">$ </span><span class="token command">time</span> rclone copy <span class="token parameter variable">--dry-run</span> <span class="token parameter variable">--progress</span> <span class="token parameter variable">--exclude</span> <span class="token string">"**/**.unpacked/"</span> .dvc/cache remote:<span class="token punctuation">..</span>.</span></code></pre></div> <p><em>rclone run with <code>--no-traverse</code> where indicated</em></p> <p><em>Benchmark platform: Python 3.7, macOS Catalina, DVC installed from pip, dual-core 3.1GHz i7 cpu</em></p> <p><strong>Local directory w/100k total files, S3 bucket w/1M total files (1 file modified since last sync)</strong> <img src="https://dvc.org/2020-11-26/dvc_rclone_bench-71a153aa67b33f2de5c350dab7dbebd3.svg" alt="benchmarks" title="DVC 1.0 vs rclone performance comparison"></p> <p>The previous chart contains benchmarks for a scenario in which the local directory contains 100,000 files, and the S3 bucket contains approximately 1 million files. One file in the local directory has been modified since the directory was last synchronized with the S3 bucket. This scenario tests the length of time it takes DVC or rclone to determine (and report to the user) that only the one modified file is missing from the S3 bucket and needs to be uploaded.</p> <p>This illustrates DVC's performance advantage over rclone with regard to synchronizing iterations of a versioned dataset over time, as well as the DVC 1.0 performance improvements over prior releases.</p> <p><em>Note: In these examples, the local file count refers to the number of files inside the original tracked directory. The number of files present in the DVC cache will differ slightly, since the DVC cache will contain an additional file representing the tracked directory itself, but the end result is that both DVC and rclone will both need to query for the same number of files (i.e. the number of files in the cache directory).</em></p> <p><strong>Local directory w/1 file, S3 bucket w/1M total files</strong> <img src="https://dvc.org/2020-11-26/dvc_rclone_bench2-1ec9a63c6674ee11a5147f15958608d8.svg" alt="benchmarks" title="DVC 1.0 vs rclone performance comparison"></p> <p>In this example, we are testing a simple scenario in which the local directory contains 1 file and the S3 bucket contains approximately 1 million files.</p> <p>In this case, in DVC 0.91 we essentially get lucky that our default choice for S3 happens to be the first query method. If we ran this same scenario with a Google Drive remote (where the 0.91 default choice is the second query method) instead of S3, we would see a very long runtime for DVC 0.91.</p> <p>Also note that here, rclone is able to determine that with a single local file to query, it should use the first query method instead of defaulting to the second method.</p> <p><em>Note: We are unsure of the reason for the rclone runtime difference with and without <code>--no-traverse</code> for this scenario, but rclone does do some computation to determine whether or not to default to <code>no-traverse</code> behavior for small query sets. It's likely that specifying <code>--no-traverse</code> allows rclone to skip that overhead entirely in this case.</em></p> <p><strong>Local directory w/1M files, Empty S3 bucket</strong> <img src="https://dvc.org/2020-11-26/dvc_rclone_bench3-ae6c58603cf1aa93382fcdcdbff9ec4b.svg" alt="benchmarks" title="DVC 1.0 vs rclone performance comparison"> <em>Note: DVC 0.91 and rclone with <code>--no-traverse</code> both take multiple hours to complete in this scenario and continue off of the chart.</em></p> <p>In this example, we are testing a simple scenario in which the local directory contains approximately 1 million files and the S3 bucket is empty.</p> <p>The difference in rclone runtime with or without <code>--no-traverse</code> in this scenario shows the performance impact of selecting the optimal query method for a given situation.</p> <p>This scenario also shows that rclone can outperform DVC with regard to collecting the list of local files during certain types of sync operations. In this case, rclone simply iterates over whatever files exist in the local directory without doing any additional steps, since our benchmark uses a one-way <code>rclone copy</code> operation.</p> <p>However, in DVC, we have some extra overhead for this step, since we collect the list of files expected to be present in the current DVC repository revision, and then verify that those files are present locally. We would then check to see if any missing files are available to be downloaded from remote storage.</p> <p>It should also be noted that in common use cases where the number of files in cloud storage continues to grow over time (such as in backup solutions or in dataset versioning), rclone's advantage in this case would only apply for this initial sync operation. Once the local dataset has been pushed to cloud storage, DVC's advantage in synchronizing modifications to existing datasets would become more apparent (as shown in the first example).</p> <h2 id="how-dvc-10-speeds-things-up" style="position:relative;">How DVC 1.0 speeds things up<a href="#how-dvc-10-speeds-things-up" aria-label="how dvc 10 speeds things up permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>So I hope that by now you're curious about DVC, and are planning on using (or maybe even already are using 😀) it to sync your files. For those who are wondering where the magic actually happens, let's dive a bit deeper into how DVC stores files, and how we were able to leverage that storage format to implement query performance optimzations in DVC 1.0. (This will also be a useful primer for anyone interested in learning about DVC internals in general.)</p> <p>Previously, we have established that:</p> <ul> <li>Selecting the right query method will have a significant performance impact.</li> <li>Reducing the number of files to query will improve performance.</li> </ul> <p>In this section, we'll cover the ways in which DVC 1.0 has directly addressed both of these key points:</p> <ul> <li>Automatically selecting the optimal query method for any given sync operation.</li> <li>Indexing cloud storage remotes to eliminate the need to query for already synchronized files.</li> </ul> <h3 id="dvc-storage-structure" style="position:relative;">DVC storage structure<a href="#dvc-storage-structure" aria-label="dvc storage structure permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Before continuing, it will be helpful for the reader to understand a few things about the DVC cache and remote storage structure.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">. ├── 00 │ ├── 411460f7c92d2124a67ea0f4cb5f85 │ ├── 6f52e9102a8d3be2fe5614f42ba989 │ └── ... ├── 01 ├── 02 ├── 03 ├── ... └── ff</code></pre></div> <p><em>Example DVC cache/remote structure</em></p> <ul> <li>Files versioned by DVC are identified and stored in subdirectories according to their <a href="https://en.wikipedia.org/wiki/MD5" target="_blank" rel="nofollow noopener noreferrer">MD5</a> hash (i.e. <a href="https://en.wikipedia.org/wiki/Content-addressable_storage" target="_blank" rel="nofollow noopener noreferrer">content addressable storage</a>).</li> <li>MD5 is an <a href="https://michiel.buddingh.eu/distribution-of-hash-values" target="_blank" rel="nofollow noopener noreferrer">evenly distributed</a> hash function, so the DVC cache (and DVC remote storage) will be evenly distributed (i.e. given a large enough dataset, each remote subdirectory will contain an approximately equal number of files)</li> </ul> <h3 id="how-dvc-10-automatically-selects-a-query-method" style="position:relative;">How DVC 1.0 automatically selects a query method<a href="#how-dvc-10-automatically-selects-a-query-method" aria-label="how dvc 10 automatically selects a query method permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In DVC, the number of files we need to query is just the number of files for a given project revision. So, as long as we can estimate the number of files in a DVC remote, we can programmatically choose the optimal query method for a remote operation.</p> <p>In DVC 1.0, we accomplish this by taking advantage of the DVC remote structure. The over/under remote size threshold only depends on the number of files being queried (i.e. the number of files in our DVC versioned dataset). And as we have already established, a DVC remote will be evenly distributed. Therefore, if we know the number of files contained in a subset of the remote, we can then estimate the number of files contained in the entire remote.</p> <p>For example, if we know that the remote subdirectory <code>00/</code> contains 10 files, we can estimate that the remote contains roughly <code>256 * 10 = 2,560</code> files in total. So, by requesting a list of one subdirectory at a time (rather than the full remote) via the cloud storage API, we can calculate a running estimate of the total remote size. If the running estimated total size goes over the threshold value, DVC will stop fetching the contains of the remote subdirectory, and switch to querying each file in our dataset individually. If DVC reaches the end of the subdirectory without the estimated size going over the threshold, it will continue to fetch the full listing for the rest of the remote.</p> <p>By estimating remote size in DVC 1.0, we can ensure that we always use the optimal method when querying remote status.</p> <h3 id="how-dvc-10-uses-indices-to-reduce-the-number-of-files-to-query" style="position:relative;">How DVC 1.0 uses indices to reduce the number of files to query<a href="#how-dvc-10-uses-indices-to-reduce-the-number-of-files-to-query" aria-label="how dvc 10 uses indices to reduce the number of files to query permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>A common DVC use case is <a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files" target="_blank" rel="nofollow noopener noreferrer">versioning</a> the contents of a large directory. As the contents of the directory changes over time, DVC will be used to push each updated version of the directory into cloud storage. In many cases, only a small number of files within that directory will be modified between project iterations.</p> <p>So after the first version of a project is pushed into cloud storage, for subsequent versions, only the small subset of changed files actually needs to be synchronized with cloud storage.</p> <p>Consider a case where a user has an existing directory with 1 million files which has been versioned and pushed to a remote with DVC. In the next iteration of the project, only a single file in the directory has been modified. We can obviously see that everything other than the one modified file will already exist in cloud storage. Ideally, we should only need to query for the single modified file.</p> <p>However, in DVC releases prior to 1.0, DVC would always need to query for every file in the directory, regardless of whether or not a given file had changed since the last time it was pushed to remote storage.</p> <p>But in DVC 1.0, we now keep an index of directories which have already been versioned and pushed into remote storage. By referencing this index, DVC will "remember" which files already exist in a remote, and will remove them from our query set at the start of a data sync operation (before we choose a query method, and before we make any cloud storage API requests).</p> <p><em>Note: This optimization only applies to DVC versioned directories. Individually versioned files (including those added with <a href="https://dvc.org/doc/command-reference/add#-R"><code>dvc add -R</code></a>) are not indexed in DVC 1.0, and will always be queried during remote operations.</em></p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>By utilizing a storage structure that allows for optimized status queries, DVC makes data synchronization incredibly fast. Coupled with the ability to quickly identify which files remain unchanged between sync operations, DVC 1.0 is a powerful data management tool.</p> <p>Whether you are upgrading from a prior DVC release, or trying DVC for the first time, we hope that all of our users are able to benefit from these new optimizations. DVC performance is an important issue, and our team is looking forward to working on further <a href="https://github.com/iterative/dvc/labels/performance" target="_blank" rel="nofollow noopener noreferrer">performance optimizations</a> in the future - across all areas in DVC, not just remotes.</p> <p>As always, if you have any questions, comments or suggestions regarding DVC performance, please feel free to connect with the DVC community on <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Discourse</a>, <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a> and <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>.</p>https://dvc.org/blog/november-20-community-gemshttps://dvc.org/blog/november-20-community-gemsWed, 25 Nov 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-if-i-checkout-a-different-git-branch-how-do-i-synchronize-with-dvc" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/773498570795778058" target="_blank" rel="nofollow noopener noreferrer">Q: If I checkout a different Git branch, how do I synchronize with DVC?</a><a href="#q-if-i-checkout-a-different-git-branch-how-do-i-synchronize-with-dvc" aria-label="q if i checkout a different git branch how do i synchronize with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Here's what we recommend: when you checkout a different Git branch in your project:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> <span class="token parameter variable">-b</span> <span class="token operator"><</span>my_great_new_branch<span class="token operator">></span></span></code></pre></div> <p>you'll want to next run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span></span></code></pre></div> <p>to synchronize your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files on that branch. But <em>did you know</em> you can automate this with a <code>post-checkout</code> Git hook? We've got a hook that executes <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> whenever you run <code>git checkout</code>, so you'll always have the correct data file versions. Head to our docs to <a href="https://dvc.org/doc/command-reference/install#install" target="_blank" rel="nofollow noopener noreferrer">read up on installing Git hooks into your DVC repository</a> so you never forget to <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a>!</p> <h3 id="q-i-have-a-big-100-gb-directory-i-want-to-know-where-the-contents-are-located-so-i-can-open-them-with-spark--is-there-a-way-to-get-the-location-of-my-files-without-caching-them-locally" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/771386223403073587" target="_blank" rel="nofollow noopener noreferrer">Q: I have a big, 100 GB directory. I want to know where the contents are located so I can open them with Spark- is there a way to get the location of my files without caching them locally?</a><a href="#q-i-have-a-big-100-gb-directory-i-want-to-know-where-the-contents-are-located-so-i-can-open-them-with-spark--is-there-a-way-to-get-the-location-of-my-files-without-caching-them-locally" aria-label="q i have a big 100 gb directory i want to know where the contents are located so i can open them with spark is there a way to get the location of my files without caching them locally permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>For this, we'd recommend the <a href="https://dvc.org/doc/api-reference/get_url#dvcapiget_url" target="_blank" rel="nofollow noopener noreferrer">DVC Python API</a>'s <code>get_url</code> function. For example, in a Python script you'd write:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> dvc<span class="token punctuation">.</span>api resource_url <span class="token operator">=</span> dvc<span class="token punctuation">.</span>api<span class="token punctuation">.</span>get_url<span class="token punctuation">(</span> <span class="token string">"<top-level-directory>"</span><span class="token punctuation">,</span> repo<span class="token operator">=</span><span class="token string">"https://github.com/<your-repo>"</span><span class="token punctuation">)</span> <span class="token punctuation">)</span></code></pre></div> <p>This code means the API will return the URL for a file that ends in <code>.dir</code>. The <code>.dir</code> file contains a JSON-formatted table of the hashes and relative paths for all the files inside <code><top-level-directory></code>. You could then parse that file to get the relative paths to the files in your remote storage.</p> <p>The JSON object will look something like this, for a file <code>foo/bar</code> in your project:</p> <div class="gatsby-highlight" data-language="json"><pre class="language-json"><code class="language-json"><span class="token punctuation">{</span> <span class="token property">"md5"</span><span class="token operator">:</span> <span class="token string">"abcd123"</span><span class="token punctuation">,</span> <span class="token property">"relpath"</span><span class="token operator">:</span> <span class="token string">"foo/bar"</span> <span class="token punctuation">}</span></code></pre></div> <p>Then you can convert the relative path to <code>foo/bar</code> to an absolute path as follows:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">https://<path-to-your-remote-storage>/ab/cd123</code></pre></div> <p>To better understand how DVC uses <a href="https://en.wikipedia.org/wiki/Content-addressable_storage" target="_blank" rel="nofollow noopener noreferrer">content-addressable storage</a> in your remote, <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory" target="_blank" rel="nofollow noopener noreferrer">read up in our docs</a>.</p> <h3 id="q-can-i-have-more-than-one-dvcyaml-file-in-my-project" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/777946398250893333" target="_blank" rel="nofollow noopener noreferrer">Q: Can I have more than one <code>dvc.yaml</code> file in my project?</a><a href="#q-can-i-have-more-than-one-dvcyaml-file-in-my-project" aria-label="q can i have more than one dvcyaml file in my project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>By default, DVC pipelines records all your stages (and their inputs and outputs) in a single file, <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. Per directory, you can have one <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. If you want to run pipelines in a different folder than your project root, you could create another <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> in a subdirectory.</p> <p>However, <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> is intended to be the only file you need to record and reproduce pipelines per directory. Pipelines are designed to have all stages stored in the same place, and there's currently no method to rename <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>.</p> <h3 id="q-how-can-i-untrack-a-file-thats-being-tracked-by-dvc-i-want-to-remove-it-from-remote-storage-and-my-local-cache-too" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/773277514717462548" target="_blank" rel="nofollow noopener noreferrer">Q: How can I untrack a file that's being tracked by DVC? I want to remove it from remote storage and my local cache, too.</a><a href="#q-how-can-i-untrack-a-file-thats-being-tracked-by-dvc-i-want-to-remove-it-from-remote-storage-and-my-local-cache-too" aria-label="q how can i untrack a file thats being tracked by dvc i want to remove it from remote storage and my local cache too permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you want to untrack a file, perhaps something you added to DVC in error, you can use <a href="https://dvc.org/doc/command-reference/remove"><code>dvc remove</code></a> to get rid of the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file corresponding to your file, and then clear your DVC cache with <a href="https://dvc.org/doc/command-reference/gc#-w"><code>dvc gc -w --cloud</code></a>. <a href="https://dvc.org/doc/user-guide/how-to/stop-tracking-data" target="_blank" rel="nofollow noopener noreferrer">Check out our docs</a> to learn more about <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> and what its flags mean (you'll want to be sure you know what you're doing, since cache cleaning deletes files permanently!).</p> <p>Alternatively, you can manually find and delete your files:</p> <ol> <li>Find the file using its hash from the corresponding <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file (or, if it's part of a pipeline, the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file).</li> <li>Look in your remote storage and remove the file matching the hash.</li> <li>Look in <code>.dvc/cache</code> and remove the file as well. If you'd like to better understand how your cache is organized, <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory" target="_blank" rel="nofollow noopener noreferrer">we have docs for that</a>.</li> </ol> <p>Your DVC remote storage and cache are simply storage locations, so once your file is gone from there it's gone for good.</p> <h3 id="q-my-dvc-cache-is-getting-a-bit-big-can-i-clean-it" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/563406153334128681/771275051382341674" target="_blank" rel="nofollow noopener noreferrer">Q: My DVC cache is getting a bit big. Can I clean it?</a><a href="#q-my-dvc-cache-is-getting-a-bit-big-can-i-clean-it" aria-label="q my dvc cache is getting a bit big can i clean it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Definitely. Have you seen the command <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>? It helps you clean your local cache- <a href="https://dvc.org/doc/command-reference/gc" target="_blank" rel="nofollow noopener noreferrer">read up here</a>. This function lets you get granular about what you're keeping; for example, you can instruct <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> to preserve cache files that are currently used your local worksapce, tips of Git branches, tagged Git commits or all Git commits. Everything else will be removed.</p> <p>One word of caution: make sure that when you collect garbage from your cache, you don't delete any files that you haven't yet pushed to a remote. If this happens, you'll delete them permanently. To be safe, it never hurts to <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> your files of interest before cleaning.</p> <h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-does-cml-support-bitbucket" style="position:relative;"><a href="https://github.com/iterative/cml/issues/140" target="_blank" rel="nofollow noopener noreferrer">Q: Does CML support Bitbucket?</a><a href="#q-does-cml-support-bitbucket" aria-label="q does cml support bitbucket permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We've just unrolled Bitbucket Cloud support! There are brand new docs in the CML project repo, <a href="https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud" target="_blank" rel="nofollow noopener noreferrer">so check them out</a> to get started. A few quick notes to keep in mind:</p> <ol> <li> <p>Like GitLab, Bitbucket Cloud requires you to create a token for authorizing CML to write comments. Make sure you don't forget this step (it's in the docs!) or you'll surely hit a permissions error.</p> </li> <li> <p>Bitbucket Cloud uses Bitbucket Pipelines for continuous integration workflows, which <a href="https://jira.atlassian.com/browse/BCLOUD-16995" target="_blank" rel="nofollow noopener noreferrer">currently doesn't support self-hosted runners</a>. That means <a href="https://community.atlassian.com/t5/Bitbucket-questions/Does-bitbucket-pipe-support-GPUs-yet/qaq-p/1042659" target="_blank" rel="nofollow noopener noreferrer">bringing your own GPUs is not supported</a>. Sorry! But you can still have all the other CML benefits of plots, tables and text in your Pull Request.</p> </li> <li> <p>Bitbucket Server support (with Jenkins and Bamboo) is under active development. Stay tuned!</p> </li> </ol> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ae915fa598568bd5e8ca33c3922d398d/39600/bitbucket_cloud_pr.png" alt="bitbucket cloud pr" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Now your Bitbucket PRs can be as pretty as you.</em></p> <h3 id="q-can-i-use-cml-with-windows-runners" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/772519007894765600" target="_blank" rel="nofollow noopener noreferrer">Q: Can I use CML with Windows runners?</a><a href="#q-can-i-use-cml-with-windows-runners" aria-label="q can i use cml with windows runners permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>While all our CML tutorials and docs use Ubuntu runners of various flavors, there's no problem with using Windows runners. Both <a href="https://docs.github.com/en/free-pro-team@latest/actions/reference/specifications-for-github-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions</a> and <a href="https://about.gitlab.com/blog/2020/01/21/windows-shared-runner-beta/" target="_blank" rel="nofollow noopener noreferrer">GitLab CI</a> have Windows runners up for grabs. And of course, you can set up your own Windows machine as a self-hosted runner (see the self-hosted runner docs for your CI system to learn more).</p> <p>What if you have a GPU? If you want to use <a href="https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpus" target="_blank" rel="nofollow noopener noreferrer"><code>nvidia-docker</code> to put GPU drivers in your container</a>, you'll want to use <code>nvidia-docker</code> with the Windows Subsytem for Linux (WSL). That means you'll first install an Ubuntu subsystem on your Windows machine, then all your Nvidia drivers, then Docker and <code>nvidia-docker</code>. Check out some <a href="https://docs.nvidia.com/cuda/wsl-user-guide/index.html" target="_blank" rel="nofollow noopener noreferrer">more docs about CUDA with WSL</a> to lear more.</p> <h3 id="q-im-using-cml-to-deploy-a-self-hosted-runner-with-gitlab-i-noticed-that-in-your-docs-the-runner-is-always-set-to-timeout-after-1800-seconds-and-then-it-gets-unregistered-from-gitlab-what-if-i-want-to-keep-my-runner-registered-after-the-job-ends" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/728693131557732403/779317571354099722" target="_blank" rel="nofollow noopener noreferrer">Q: I'm using CML to deploy a self-hosted runner with GitLab. I noticed that in your docs, the runner is always set to timeout after 1800 seconds, and then it gets unregistered from GitLab. What if I want to keep my runner registered after the job ends?</a><a href="#q-im-using-cml-to-deploy-a-self-hosted-runner-with-gitlab-i-noticed-that-in-your-docs-the-runner-is-always-set-to-timeout-after-1800-seconds-and-then-it-gets-unregistered-from-gitlab-what-if-i-want-to-keep-my-runner-registered-after-the-job-ends" aria-label="q im using cml to deploy a self hosted runner with gitlab i noticed that in your docs the runner is always set to timeout after 1800 seconds and then it gets unregistered from gitlab what if i want to keep my runner registered after the job ends permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>With CML, we introduced an approach using Docker Machine to provision instances in the cloud, and then use <code>dvc run</code> to register them as self-hosted runners to completed your workflow. As this question points out, we like to set runners to timeout after 1800 seconds- that's why you'll see this code in our <a href="https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">sample "Cloud GPU" workflow</a>:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">docker</span> run <span class="token parameter variable">--name</span> myrunner <span class="token parameter variable">-d</span> <span class="token parameter variable">--gpus</span> all <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_IDLE_TIMEOUT</span><span class="token operator">=</span><span class="token number">1800</span> <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_LABELS</span><span class="token operator">=</span>cml,gpu <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_REPO</span><span class="token operator">=</span><span class="token variable">$CI_SERVER_UR</span> <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">repo_token</span><span class="token operator">=</span><span class="token variable">$REGISTRATION_TOKEN</span> <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_DRIVER</span><span class="token operator">=</span>gitlab <span class="token punctuation">\</span> iterativeai/cml:0-dvc2-base1-gpu runner</span></code></pre></div> <p>We did this so you'll avoid running up GPU hours and a big bill. If you're not worried about that, though, you can set the environmental variable <code>RUNNER_IDLE_TIMEOUT</code> in the <code>dvcorg/cml</code> container to 0. Then, your self-hosted runner will stay on forever, or at least until you manually turn it off.</p> <p>By the way… stay tuned for a big update here. We're currently replacing the Docker Machine approach with a method based on TerraForm, and we can't wait to unveil it. It should make deploying cloud instances on AWS, GCP and Azure work with less code than ever.</p> <h3 id="q-what-did-deevee-do-for-thanksgiving" style="position:relative;">Q: What did DeeVee do for Thanksgiving?<a href="#q-what-did-deevee-do-for-thanksgiving" aria-label="q what did deevee do for thanksgiving permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>She stayed home and made mashed potatoes.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/252ccf20ce3c7a53778c4d2a07c2a99e/39600/deevee_n_taters.png" alt="deevee n taters" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>That's all for now, everyone! As always, keep in touch with all your questions big and small.</p>https://dvc.org/blog/november-20-dvc-heartbeathttps://dvc.org/blog/november-20-dvc-heartbeatWed, 11 Nov 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Welcome to the November Heartbeat! Let's dive in with some news from the team.</p> <h3 id="datacouncil-interviews-dmitry" style="position:relative;">DataCouncil interviews Dmitry<a href="#datacouncil-interviews-dmitry" aria-label="datacouncil interviews dmitry permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/DataCouncilAI" target="_blank" rel="nofollow noopener noreferrer">Data Council</a>'s <a href="https://twitter.com/petesoder?lang=en" target="_blank" rel="nofollow noopener noreferrer">Peter Soderling</a> interviewed CEO Dmitry! Check out the recording from Data Council's live event, including Q&A from the Data Council community, on YouTube.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/8dBCgIa7TGE?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="were-hiring" style="position:relative;">We're hiring<a href="#were-hiring" aria-label="were hiring permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Did you know we're hiring for two roles in our growing team? We're looking for:</p> <ul> <li> <p>A <a href="https://weworkremotely.com/remote-jobs/iterative-senior-software-engineer-open-source-dev-tools-3" target="_blank" rel="nofollow noopener noreferrer"><strong>Senior Software Engineer</strong></a> for the core DVC team- someone with strong Python development skills who can build and ship essential DVC features.</p> </li> <li> <p>A <a href="https://weworkremotely.com/remote-jobs/iterative-developer-advocate" target="_blank" rel="nofollow noopener noreferrer"><strong>Developer Advocate</strong></a> to lead the community, support contributors and new users, and create new content like blogs and videos about DVC and CML.</p> </li> </ul> <p>Here are a few reasons to consider joining us:</p> <ul> <li>Your work will be visible and will be used by thousands developers every day!</li> <li>We're a small, fully remote team. Work from anywhere!</li> <li>Competitive salary and benefits</li> <li>Family-friendly benefits, including unlimited PTO</li> </ul> <p>If you're interested, we'd love to hear from you about either role (and we welcome referrals if you know a good candidate)!</p> <h3 id="new-videos" style="position:relative;">New videos<a href="#new-videos" aria-label="new videos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We're continuing to develop our video docs, and now half of our "Getting Started" section has video accompaniments. Check out our latest release on <a href="https://dvc.org/doc/start/data-and-model-access" target="_blank" rel="nofollow noopener noreferrer">data access with DVC</a>:</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/EE7Gk84OZY8?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>This video covers functions like <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a>, <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>, and the DVC Python API.</p> <p>We took a quick break from releasing videos during the US election week, but look out for a new video on our <a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a> about model testing with continuous integration! Subscribe to get alerts whenever we have something new :)</p> <h3 id="workshops-and-conferences" style="position:relative;">Workshops and conferences<a href="#workshops-and-conferences" aria-label="workshops and conferences permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>As usual, there are plenty of remote meetings on our schedules:</p> <ul> <li> <p><a href="http://www.bootcamp.dadosesaude.com/" target="_blank" rel="nofollow noopener noreferrer">HealthData Bootcamp</a> is a weeklong intensive for all things biomedical data science. Dmitry and myself (Elle), plus DVC Ambassadors Mikhail Rozhkov and Marcel Ribeiro-Dantas, will be presenting lectures and workshops about MLOps throughout the week!</p> </li> <li> <p>I'll be leading a hands-on workshop at the <a href="https://torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Toronto Machine Learning Society Annual Meeting</a>. It'll cover how to get started using <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning</a>(CML) with GitHub Actions- <a href="https://torontomachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">register here</a>, and be sure to reserve your spot in the workshop.</p> </li> <li> <p>This week, I have another talk at <a href="https://global.pydata.org/" target="_blank" rel="nofollow noopener noreferrer">PyData Global</a> about CML. PyData Global is online for the first time ever and promises to be a great gathering for Python-using data scientists in industry and academic research alike.</p> </li> </ul> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here are some of our favorite happenings around the MLOps community this week.</p> <h3 id="a-new-online-course" style="position:relative;">A new online course<a href="#a-new-online-course" aria-label="a new online course permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/GokuMohandas" target="_blank" rel="nofollow noopener noreferrer">Goku Mohandas</a>, founder of <a href="https://twitter.com/madewithml" target="_blank" rel="nofollow noopener noreferrer">Made with ML</a>, announced plans to release a new online course about putting ML in production. The curriculum will cover everything from experiment tracking to deploying and monitoring models in production, and you can expect DVC to be included! Keep an eye on Goku and Made with ML on Twitter for updates.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🔥 Putting ML in Production! We're going to publicly develop <a href="https://twitter.com/MadeWithML">@madewithml</a>'s first ML service. Here is the broad curriculum: <br><br>- 📦 Product<br>- 🔢 Data<br>- 🤖 Modeling<br>- 📝 Scripting<br>- 🛠 API<br>- 🚀 Production<br><br>More details (lessons, task, etc.) here: <a href="https://t.co/xmMm9XGK9j">https://t.co/xmMm9XGK9j</a><br><br>Thread 👇 <a href="https://t.co/T0uLPb2QbR">pic.twitter.com/T0uLPb2QbR</a></p>— Goku Mohandas (@GokuMohandas) <a href="https://twitter.com/GokuMohandas/status/1315990996849627136">October 13, 2020</a></blockquote> <h3 id="our-favorite-blogs" style="position:relative;">Our favorite blogs<a href="#our-favorite-blogs" aria-label="our favorite blogs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/visenger" target="_blank" rel="nofollow noopener noreferrer">Dr. Larysa Visengeriyeva</a>, creator of the top-notch <a href="https://github.com/visenger/awesome-mlops" target="_blank" rel="nofollow noopener noreferrer">"Awesome MLOps" GitHub repo</a>, and DevOps expert Anja Kammer wrote a must-read essay about CI/CD for ML (note: it's published in German; I used Chrome's built-in translation to read in English).</p> <p>The blog covers key concepts like continuous integration, deployment, and training with ML, as well as practical approaches and sample architectures.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.innoq.com/de/articles/2020/10/mlops-operations-fuer-machine-learning/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLOps: You Train It, You Run It!</h4> <div class="elp-description">CI / CD & Operations for machine learning</div> <div class="elp-link">innoq.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-11-08/innoq-35328b26bf404a8d5892cea6cae83fb3.png" alt="MLOps: You Train It, You Run It!"> </div> </a> </section> <p></p> <p><em>Also</em>, there's some cool art.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b38952332c3d2bbd69dbeb9cf47fa685/39600/mlops_diagram.png" alt="mlops diagram" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Another blog on our radar: <a href="https://twitter.com/lopp_sean" target="_blank" rel="nofollow noopener noreferrer">Sean Lopp</a> at <a href="https://twitter.com/rstudio" target="_blank" rel="nofollow noopener noreferrer">RStudio</a> made the first known blog about a CML report with a ggplot! Using RStudio's <a href="https://github.com/r-lib/actions" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions for R</a> and CML, Sean built a sample data science workflow that runs automatically in GitHub Actions on a push. He reports on some pros, cons, and areas for future development to make R language data science easy to automate.</p> <p> </p><section class="elp-content-holder"> <a href="https://loppsided.blog/posts/2020-10-26-tidymodels-dvc-mashup/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Tidymodels DVC Mashup</h4> <div class="elp-description">Using Github Actions and Data Version Control for ModelOps in R</div> <div class="elp-link">loppsided.blog</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-11-08/sean_lopp-6c26e81b394a7c61ebab8e0dc7f00e56.jpg" alt="Tidymodels DVC Mashup"> </div> </a> </section> <p></p> <p>Finally, developer <a href="https://twitter.com/stribny" target="_blank" rel="nofollow noopener noreferrer">Petr Stribny</a> wrote about how to version big files in a Git project with DVC. It's a short-and-sweet guide to getting started, and if you're trying to decide if DVC is for you, this is worth a look.</p> <p> </p><section class="elp-content-holder"> <a href="https://stribny.name/blog/2020/10/versioning-large-files-in-git-with-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Versioning large files in git with DVC</h4> <div class="elp-description">Software development and beyond</div> <div class="elp-link">stribny.name</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-11-08/petr-f92d049b5032e322835f05e74cc215f7.jpg" alt="Versioning large files in git with DVC"> </div> </a> </section> <p></p> <h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To wrap it up, here's a kind tweet that we really like. It's always good to be mentioned in the same tweet as some of our heroes :)</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Companies such as <a href="https://twitter.com/astronomerio">@astronomerio</a>, <a href="https://twitter.com/HashiCorp">@HashiCorp</a>, <a href="https://twitter.com/supabase_io">@supabase_io</a>, <a href="https://twitter.com/Iterativeai">@Iterativeai</a> are excellent examples of companies with a relentless focus on building for developer love.</p>— Ethan Batraski (@ethanjb) <a href="https://twitter.com/ethanjb/status/1316833012676354048">October 15, 2020</a></blockquote> <p>Thanks for reading this month!</p>https://dvc.org/blog/october-20-community-gemshttps://dvc.org/blog/october-20-community-gemsMon, 26 Oct 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-whats-in-a-dvc-file-and-what-would-happen-if-decided-not-push-my-dvc-files-to-my-git-repo" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/760920403064520755" target="_blank" rel="nofollow noopener noreferrer">Q: What's in a <code>.dvc</code> file, and what would happen if decided not push my <code>.dvc</code> files to my Git repo?</a><a href="#q-whats-in-a-dvc-file-and-what-would-happen-if-decided-not-push-my-dvc-files-to-my-git-repo" aria-label="q whats in a dvc file and what would happen if decided not push my dvc files to my git repo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC creates lightweight metafiles (<a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files) that correspond to large artifacts in your project. These <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files contain pointers to your artifacts in remote storage (we use a simple content-based storage scheme). Because we use content-based storage, the remote storage itself isn't designed for browsing (although <a href="https://github.com/iterative/dvc/issues/3621" target="_blank" rel="nofollow noopener noreferrer">there are some discussions</a> about how to make stored files more "discoverable", and you can always identify them manually by their contents and meta-information like timestamps).</p> <p>Your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files help establish meaningful links between human-readable filenames and file contents in remote storage, as well as to use Git versioning on your stored datasets and models. You can think of your DVC remote storage as a <em>compliment</em> to your Git repository, not a replacement.</p> <p>In other words… if you're not Git versioning your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files, you're not versioning anything in DVC remote storage!</p> <h3 id="q-can-i-limit-the-number-of-network-connections-used-by-dvc-during-dvc-pull" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/739760523293360182" target="_blank" rel="nofollow noopener noreferrer">Q: Can I limit the number of network connections used by DVC during <code>dvc pull</code>?</a><a href="#q-can-i-limit-the-number-of-network-connections-used-by-dvc-during-dvc-pull" aria-label="q can i limit the number of network connections used by dvc during dvc pull permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yep- by default, DVC data transfer operations use a number of threads proportional to the number of CPUs detected. But, there's a handy flag for <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> that lets you override the defaults:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">-j <number>, --jobs <number> - number of threads to run simultaneously to handle the downloading of files from the remote. The default value is 4 * cpu_count(). For SSH remotes, the default is just 4. Using more jobs may improve the total download speed if a combination of small and large files are being fetched.</code></pre></div> <h3 id="q-im-working-on-a-multi-class-classification-task-can-dvc-plots-show-multiple-precision-recall-curves--one-for-each-class" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/765117500530491472" target="_blank" rel="nofollow noopener noreferrer">Q: I'm working on a multi-class classification task. Can <code>dvc plots</code> show multiple precision recall curves- one for each class?</a><a href="#q-im-working-on-a-multi-class-classification-task-can-dvc-plots-show-multiple-precision-recall-curves--one-for-each-class" aria-label="q im working on a multi class classification task can dvc plots show multiple precision recall curves one for each class permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Currently, <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a> doesn't support multiple linear curves on a single plot (except for <a href="https://dvc.org/doc/command-reference/plots/diff"><code>dvc plots diff</code></a>, of course!). But, you could make one precision recall curve per class and display them side-by-side.</p> <p>To do this, you'd want to write the precision recall curve values to separate files for each class (<code>prc-0.json</code>,<code>prc-1.json</code>, etc.). Then you would run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots show</span> prc-0.json prc-1.json</span></code></pre></div> <p>And you'll see two plots side-by-side! A benefit of this approach is that when you run <a href="https://dvc.org/doc/command-reference/plots/diff"><code>dvc plots diff</code></a> to compare precision recall curves across Git commits, you'll get a comparison plotted for each class.</p> <h3 id="q-are-you-sure-i-should-commit-my-dvcconfig-file-it-contains-my-logging-credentials-for-storage-and-im-nervous-about-adding-it-to-a-shared-git-repository" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/768770079596740650" target="_blank" rel="nofollow noopener noreferrer">Q: Are you sure I should commit my <code>.dvc/config</code> file? It contains my logging credentials for storage, and I'm nervous about adding it to a shared Git repository.</a><a href="#q-are-you-sure-i-should-commit-my-dvcconfig-file-it-contains-my-logging-credentials-for-storage-and-im-nervous-about-adding-it-to-a-shared-git-repository" aria-label="q are you sure i should commit my dvcconfig file it contains my logging credentials for storage and im nervous about adding it to a shared git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a common scenario- you don't necessarily want to broadcast your remote storage credentials to everyone on your team, but you still want to check-in your DVC setup (meaning, your <code>.dvc/config</code> file). In this case, you want to use a <code>local</code> config file!</p> <p>You can use the command</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> <span class="token parameter variable">--local</span></span></code></pre></div> <p>to setup remote credentials that will be stored in <code>.dvc/config.local</code>- by default, this file is in your <code>.gitignore</code> so you don't have to worry about accidentally committing secrets to your Git repository. <a href="https://dvc.org/doc/command-reference/config" target="_blank" rel="nofollow noopener noreferrer">Check out the docs</a> for more, including the <code>--system</code> and <code>--global</code> options for setting your configuration for multiple projects and users respectively.</p> <h2 id="cml-questions" style="position:relative;">CML Questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-whats-the-file-size-limit-for-publishing-files-with-cml-publish" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/751001285100306502" target="_blank" rel="nofollow noopener noreferrer">Q: What's the file size limit for publishing files with <code>cml publish</code>?</a><a href="#q-whats-the-file-size-limit-for-publishing-files-with-cml-publish" aria-label="q whats the file size limit for publishing files with cml publish permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><code>cml publish</code> is a service for hosting files that are embedded in CML reports, like images, audio files, and GIFS. By default, we have a limit of 2 MB per upload.</p> <p>If your files are larger than this (which can happen, depending on the machine learning problem you're working on!) we recommend using GitLab's artifact storage. <a href="https://github.com/iterative/cml/issues/232" target="_blank" rel="nofollow noopener noreferrer">Based on discussions in the community</a>, we recently implemented a CML flag (<code>--gitlab-uploads</code>) to streamline the process:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cml</span> publish movie.mov <span class="token parameter variable">--md</span> <span class="token parameter variable">--gitlab-uploads</span> <span class="token operator">></span> report.md</span></code></pre></div> <p>Note that we don't currently have an analagous solution for GitHub, because GitHub artifacts expire after 90 days (whereas they're permanent in GitLab).</p> <h3 id="q-im-getting-a-mysterious-error-message-failed-guessing-mime-type-of-file-when-i-try-to-use-cml-publish-whats-going-on" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/763840404675756042" target="_blank" rel="nofollow noopener noreferrer">Q: I'm getting a mysterious error message, <code>Failed guessing mime type of file</code>, when I try to use <code>cml publish</code>. What's going on?</a><a href="#q-im-getting-a-mysterious-error-message-failed-guessing-mime-type-of-file-when-i-try-to-use-cml-publish-whats-going-on" aria-label="q im getting a mysterious error message failed guessing mime type of file when i try to use cml publish whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This error message usually means that the target of <code>cml publish</code>- for example,</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cml</span> publish <span class="token operator"><</span>target file<span class="token operator">></span></span></code></pre></div> <p>is not found. Check for typos in the target filename and ensure that the file was in fact generated during the run (if it isn't part of your Git repository). We've <a href="https://github.com/iterative/cml/issues/308" target="_blank" rel="nofollow noopener noreferrer">opened an issue</a> to add a more informative error message in the future.</p> <h3 id="q-in-my-github-actions-workflow-i-use-dvc-metrics-diff-to-compare-metrics-generated-during-the-run-to-metrics-on-the-main-branch-and-print-a-table--but-the-table-isnt-showing-any-of-the-metrics-from-main-what-could-be-happening" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/768815157034876929" target="_blank" rel="nofollow noopener noreferrer">Q: In my GitHub Actions workflow, I use <code>dvc metrics diff</code> to compare metrics generated during the run to metrics on the main branch and print a table- but the table isn't showing any of the metrics from <code>main</code>. What could be happening?</a><a href="#q-in-my-github-actions-workflow-i-use-dvc-metrics-diff-to-compare-metrics-generated-during-the-run-to-metrics-on-the-main-branch-and-print-a-table--but-the-table-isnt-showing-any-of-the-metrics-from-main-what-could-be-happening" aria-label="q in my github actions workflow i use dvc metrics diff to compare metrics generated during the run to metrics on the main branch and print a table but the table isnt showing any of the metrics from main what could be happening permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When a continuous integration runner won't report metrics from previous versions of your project (or other branches), that's usually a sign that the runner doesn't have access to the full Git history of your project or your metrics themselves. Here are a few things to check for:</p> <ol> <li><strong>Did you fetch your Git working tree in the runner?</strong> Functions like <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a> require the Git history to be accessible- make sure that in your workflow, before you run this function, you've done a <code>git fetch</code>. We recommend:</li> </ol> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git fetch</span> <span class="token parameter variable">--prune</span> <span class="token parameter variable">--unshallow</span></span></code></pre></div> <ol start="2"> <li> <p><strong>Are your metrics in your DVC remote?</strong> If your metrics are <em>cached</em> (which they are by default when you create a DVC pipeline), your DVC remote should be accessible to your runner. That means you need to add any credentials as repository secrets (or variables, in GitLab), and do <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> in your workflow before attempting <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a>.</p> </li> <li> <p><strong>Are your metrics in your local workspace?</strong> If you are <em>not</em> using a DVC remote, your metric files must be <em>uncached</em> and committed to your Git repository. To explore an example, say you have a pipeline stage that creates <code>metric.json</code>:</p> </li> </ol> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> mystage <span class="token parameter variable">-m</span> metric.json train.py</span></code></pre></div> <p>By default, <code>metric.json</code> is cached and ignored by Git- which means that if you aren't using a DVC remote in your CI workflow, <code>metric.json</code> will effectively be abandoned on your local machine! You can avoid this by using the <code>-M</code> flag instead of <code>-m</code> in <code>dvc run</code>, or manually adding the field <code>cache: false</code> to your metric in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. Be sure to remove your metrics from any <code>.gitignore</code> files, and commit and push them to your Git repository.</p> <p>That's all for this month- Happy Halloween! Watch out for scary bugs. 🐛</p>https://dvc.org/blog/october-20-dvc-heartbeathttps://dvc.org/blog/october-20-dvc-heartbeatMon, 12 Oct 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="paweł-gets-ready-to-speak-at-polands-largest-data-science-meeting" style="position:relative;">Paweł gets ready to speak at Poland's largest data science meeting<a href="#pawe%C5%82-gets-ready-to-speak-at-polands-largest-data-science-meeting" aria-label="paweł gets ready to speak at polands largest data science meeting permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC developer Paweł Redzyński (he's written a lot of the code behind <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a>) is giving at talk at the <a href="https://dssconf.pl/" target="_blank" rel="nofollow noopener noreferrer">Data Science Summit</a> in Poland! The virtual meeting is on October 16, but talks are available for streaming on demand up to a week before. Paweł's talk is part of the DataOps & Development track, where he'll be sharing about CML and GitHub Actions (note that it'll be delivered in English).</p> <p><a href="https://dssconf.pl" target="_blank" rel="nofollow noopener noreferrer"><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4af5a02cc92cd39e8cc7a546e1cbada8/39600/dss.png" alt="dss" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></a></p> <h3 id="dmitry-talks-at-data-engineering-melbourne" style="position:relative;">Dmitry talks at Data Engineering Melbourne<a href="#dmitry-talks-at-data-engineering-melbourne" aria-label="dmitry talks at data engineering melbourne permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>CEO <a href="https://www.meetup.com/Data-Engineering-Melbourne/events/267033998/" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov dropped into the Data Engineering Melbourne meetup</a> to talk about Data Versioning and DataOps! He spoke about the differences between end-to-end platforms and ecosystems of tools, and how this distinction informs the development of software like DVC and CML (hint: we picked tools over platforms).</p> <p>Keep an eye on this meetup, which is now accessible to folks on all continents thanks to the magic of the internet :)</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/Data-Engineering-Melbourne/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data Engineering Melbourne</h4> <div class="elp-description">Dmitry Petrov presents on DataOps and versioning.</div> <div class="elp-link">meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-10-12/Meetup_Logo-04501e404e41367b16280cd0515d54df.png" alt="Data Engineering Melbourne"> </div> </a> </section> <p></p> <h3 id="elle-has-talks-at-pycon-india-and-pydata-global" style="position:relative;">Elle has talks at PyCon India and PyData Global<a href="#elle-has-talks-at-pycon-india-and-pydata-global" aria-label="elle has talks at pycon india and pydata global permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Last week I gave a talk about CML at <a href="https://in.pycon.org/cfp/2020/proposals/how-to-make-continuous-integration-work-with-machine-learning~avK5b/" target="_blank" rel="nofollow noopener noreferrer">PyCon India</a>, and have another one coming up at <a href="https://global.pydata.org/talks/321" target="_blank" rel="nofollow noopener noreferrer">PyData Global</a> this November 11-15.</p> <p> </p><section class="elp-content-holder"> <a href="https://global.pydata.org/talks/321" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DevOps for science: using continuous integration for rigorous and reproducible analysis</h4> <div class="elp-description">PyData Global</div> <div class="elp-link">https://global.pydata.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-10-12/pydata-4857264047a11851293de84b3c988b3d.png" alt="DevOps for science: using continuous integration for rigorous and reproducible analysis"> </div> </a> </section> <p></p> <p>PyData Global has a fantastic lineup of talks spanning science and engineering, so please consider joining!</p> <h3 id="dvc-at-datafest" style="position:relative;">DVC at DataFest<a href="#dvc-at-datafest" aria-label="dvc at datafest permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC Ambassador Mikhail Rozhkov co-hosted the Machine Learning REPA (Reproducibility, Experiments and Pipelines Automation) track of <a href="https://datafest.ru/" target="_blank" rel="nofollow noopener noreferrer">DataFest 2020</a>, and DVC showed up in full force! There were talks from Dmitry, ambassador Marcel Ribeiro-Dantas, and myself about all aspects of MLOps and automation.</p> <p>DataFest is over (until next year, anyway), but <a href="http://ml-repa.ru/en#about" target="_blank" rel="nofollow noopener noreferrer">visit the ML-REPA community</a> for ongoing content and opportunities for networking.</p> <h3 id="new-videos" style="position:relative;">New videos<a href="#new-videos" aria-label="new videos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Since the summer, we've been building our <a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a>. It's going great- we've gotten more than 18,000 views in the last few months and 1,500 subscribers!</p> <p>Our latest video in the <a href="https://www.youtube.com/playlist?list=PL7WG7YrwYcnDBDuCkFbcyjnZQrdskFsBz" target="_blank" rel="nofollow noopener noreferrer">MLOps Tutorials</a> series introduced using GitHub Actions for model testing- instead of training a model in continuous integration, the idea is to train locally and "check-in" your favorite model for testing in a standardized environment. This approach lets you completely control the environment, infrastructure, and code used to evaluate your model, and save the run in a place that's easy to share (GitHub!).</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/bSXUJRnQPPo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>We'll be going deeper into the art and craft of testing ML models in the next few weeks, so stay tuned. Another big initative is adding videos to our docs: since video seems like a popular format for a lot of learners, we're working to supplement our official docs with embedded videos. Check out our first installment on the <a href="https://dvc.org/doc/start/data-and-model-versioning" target="_blank" rel="nofollow noopener noreferrer">Getting Started with Data Versioning</a>.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/kLKBcPonMYw?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our community makes some amazing tutorials. Here are a few on our radar:</p> <p>Data scientist and full-stack developer <a href="https://github.com/ashutosh1919" target="_blank" rel="nofollow noopener noreferrer">Ashutosh Hathidara</a> shared an end-to-end machine learning project made with DVC and CML… and released it in video form! It's a neat setup and a nice model for folks to study.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/H1VBsK7XiKs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>Another detailed and easy-to-follow tutorial, with a similarly impressive scope, appeared on <a href="https://www.heise.de/" target="_blank" rel="nofollow noopener noreferrer">Heise Online</a>. This project puts together DVC, Cortex, and ONNX to develop and deploy a model trained on the Fashion MNIST dataset (note: the article is in German, and I read it with Chrome's English translation).</p> <p> </p><section class="elp-content-holder"> <a href="https://www.heise.de/hintergrund/Verwaltung-und-Inbetriebnahme-von-ML-Modellen-4911723.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Managing and commissioning ML models</h4> <div class="elp-description">Tools like DVC and Cortex, which are designed for the operationalization of AI projects, are intended to help developers deploy models in production.</div> <div class="elp-link">https://heise.de</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-10-12/heise-2f146c12022ee732eed47276fbd88d8d.png" alt="Managing and commissioning ML models"> </div> </a> </section> <p></p> <p>You'll also want to check out <a href="https://www.anno.ai/" target="_blank" rel="nofollow noopener noreferrer">anno.ai</a>'s tutorial about managing large datasets with DVC and S3 storage- it's detailed, but also a quick-start guide informed by the team's practical experience.</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@anno.ai/mlops-and-data-managing-large-ml-datasets-with-dvc-and-s3-part-1-d5b8f2fb8280" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLOps and Data: Managing Large ML Datasets with DVC and S3 (Part 1)</h4> <div class="elp-description">A quick start guide to version control for machine learning data</div> <div class="elp-link">medium.com/@anno.ai</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-10-12/legos-b1ceab755de4875476325388196a546a.jpg" alt="MLOps and Data: Managing Large ML Datasets with DVC and S3 (Part 1)"> </div> </a> </section> <p></p> <p>Data scientist and mathematician <a href="https://twitter.com/KhuyenTran16" target="_blank" rel="nofollow noopener noreferrer">Khuyen Tran</a> blogged about why and how to start using DVC- and her tutorial includes Google Drive remote storage, a feature we're especially excited about. Check it out and follow along with her code examples!</p> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-7cb49c229fe0" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Introduction to DVC: Data Version Control Tool for Machine Learning Projects</h4> <div class="elp-description">Just like Git, but with Data!</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-10-12/khuyen_tran-f5684c74ede8821217b19dbcf295d7ec.jpg" alt="Introduction to DVC: Data Version Control Tool for Machine Learning Projects"> </div> </a> </section> <p></p> <p>And to end on a thoughtful note… have you seen this thread by ML Engineer <a href="https://twitter.com/sh_reya" target="_blank" rel="nofollow noopener noreferrer">Shreya Shankar</a>? She beautifully summarizes many of the ideas and technical challenges our community thinks about every day. Read and reflect!</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">In good software practices, you version code. Use Git. Track changes. Code in master is ground truth.<br><br>In ML, code alone isn't ground truth. I can run the same SQL query today and tomorrow and get different results. How do you replicate this good software practice for ML? (1/7)</p>— Shreya Shankar (@sh_reya) <a href="https://twitter.com/sh_reya/status/1314338372073263112">October 8, 2020</a></blockquote>https://dvc.org/blog/september-20-community-gemshttps://dvc.org/blog/september-20-community-gemsMon, 28 Sep 2020 00:00:00 GMT<h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-when-i-try-to-push-to-my-dvc-remote-i-get-an-error-about-my-ssh-rsa-keys-whats-going-on" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/748735263634620518" target="_blank" rel="nofollow noopener noreferrer">Q: When I try to push to my DVC remote, I get an error about my SSH-RSA keys. What's going on?</a><a href="#q-when-i-try-to-push-to-my-dvc-remote-i-get-an-error-about-my-ssh-rsa-keys-whats-going-on" aria-label="q when i try to push to my dvc remote i get an error about my ssh rsa keys whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you're using DVC with an SSH-protected remote, DVC uses a Python library called <code>paramiko</code> to create a connection to your remote. There is a <a href="https://stackoverflow.com/questions/51955990/base64-decoding-error-incorrect-padding-when-loading-putty-ppk-private-key-to" target="_blank" rel="nofollow noopener noreferrer">known issue</a> that <code>paramiko</code> expects RSA keys in OpenSSH key format, and can throw an error if the keys are in an alternative format (such as default PuTTY formatted keys). If this is the case, you'll likely see:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">ERROR: unexpected error - ('... ssh-rsa ...=', Error('Incorrect padding',))</code></pre></div> <p>To fix this, convert your RSA key to the OpenSSH format. Tools like <a href="https://www.puttygen.com/" target="_blank" rel="nofollow noopener noreferrer">PuTTYgen</a> and <a href="https://mobaxterm.mobatek.net/" target="_blank" rel="nofollow noopener noreferrer">MobaKeyGen</a> can help you do this.</p> <h3 id="q-can-i-have-multiple-paramyaml-files-in-a-project" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/753322309942509578" target="_blank" rel="nofollow noopener noreferrer">Q: Can I have multiple <code>param.yaml</code> files in a project?</a><a href="#q-can-i-have-multiple-paramyaml-files-in-a-project" aria-label="q can i have multiple paramyaml files in a project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, you can have as many separate parameter files as you'd like. It's only important that they are correctly specified in your DVC pipeline stages.</p> <p>For example, if you have files <code>params_data_processing.yaml</code> and <code>params_model.yaml</code> in your project (perhaps to store hyperparameters of your data processing and model fitting stages, respectively), you'll want to call the right file at each stage. For example:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> preprocess <span class="token punctuation">\</span> <span class="token parameter variable">-p</span> params_data_process.yaml:param1,param2,<span class="token punctuation">..</span>.</span></code></pre></div> <h3 id="q-is-there-a-way-to-automatically-produce-svg-plots-from-dvc-plot-i-dont-like-having-to-click-through-the-vega-lite-gui-to-get-an-svg-and-my-plots-look-so-small-when-i-access-them-in-the-browser" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/750012082149392414" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to automatically produce SVG plots from <code>dvc plot</code>? I don't like having to click through the Vega-Lite GUI to get an SVG, and my plots look so small when I access them in the browser.</a><a href="#q-is-there-a-way-to-automatically-produce-svg-plots-from-dvc-plot-i-dont-like-having-to-click-through-the-vega-lite-gui-to-get-an-svg-and-my-plots-look-so-small-when-i-access-them-in-the-browser" aria-label="q is there a way to automatically produce svg plots from dvc plot i dont like having to click through the vega lite gui to get an svg and my plots look so small when i access them in the browser permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If your DVC plots (and by DVC plots, we mean Vega-Lite plots 😉) look small in your browser, you can modify this programmatically! DVC generates Vega-Lite plots by way of a few templates that come pre-loaded. The templates are in <code>.dvc/plots</code> (assuming you're in a DVC directory).</p> <p>Find the template that corresponds to your plot (if you didn't specify a plot type in your CLI command, it's probably <code>default.json</code>) and modify the <code>height</code> and <code>width</code> paramters. Then save your changes.</p> <p>For more about how to modify your plot templates, check out the <a href="https://vega.github.io/vega/docs/specification/" target="_blank" rel="nofollow noopener noreferrer">Vega docs</a>. If you're considering making a whole new template that's custom for your data viz needs, <a href="https://dvc.org/doc/command-reference/plots#custom-templates" target="_blank" rel="nofollow noopener noreferrer">we've got docs on that</a>, too.</p> <p>One last tip: did you know about the <a href="https://anaconda.org/conda-forge/vega-lite-cli" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite CLI</a>? It provides functions for converting Vega-Lite plots to <code>.pdf</code>,<code>.png</code>,<code>.svg</code>, and <code>.vg</code> (Vega) formats. To use this approach with DVC, you'll want to use the <code>--show-vega</code> flag to print your plot specification to a <code>.json</code> file.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots</span> <span class="token parameter variable">--show-vega</span> <span class="token operator">></span> vega.json </span><span class="token line"><span class="token input">$ </span><span class="token command">vl2svg</span> vega.json</span></code></pre></div> <h3 id="q-im-confused-about-external-dependencies-and-outputs-whats-the-difference" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/752478399326453840" target="_blank" rel="nofollow noopener noreferrer">Q: I'm confused about external dependencies and outputs. What's the difference?</a><a href="#q-im-confused-about-external-dependencies-and-outputs-whats-the-difference" aria-label="q im confused about external dependencies and outputs whats the difference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In short, external outputs and dependencies are files or directories that are tracked by DVC, but physically reside outside of the local workspace. This could happen for a few reasons:</p> <ul> <li>You want to version a dataset in cloud storage that is too large to transfer to your local workspace efficiently</li> <li>Your DVC pipeline writes directly to cloud storage</li> <li>Your DVC pipeline depends on a dataset or other file in cloud storage</li> </ul> <p>An <strong>external output</strong> is declared in two ways: for example, if you have a file <code>data.csv</code> in S3 storage, you can use <a href="https://dvc.org/doc/command-reference/add#--external"><code>dvc add --external s3://mybucket/data.csv</code></a> to begin DVC tracking the file (<a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">there are plenty more details and tips about managing external data in our docs</a>)). You can also declare <code>data.csv</code> as an output of a DVC pipeline with <code>dvc run -o s3://mybucket/data.csv</code>.</p> <p>An <strong>external dependency</strong> is a dependency of a DVC pipeline that resides in cloud storage. It's declared with the syntax <code>dvc run -d s3://mybucket/data.csv</code>.</p> <p>One other difference to note: DVC doesn't cache external dependencies; it merely checks if they have changed when you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>. On the other hand, DVC <em>does</em> cache external outputs. You'll want to set up an <a href="https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache#configure-the-shared-cache" target="_blank" rel="nofollow noopener noreferrer">external cache</a> in the same remote location where your files are stored. This is because the default cache location (in your local workspace) no longer makes sense when the dataset never "visits" your local workspace! An external cache works largely the same as a typical cache in your workspace.</p> <h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-how-can-i-use-cml-with-my-own-docker-container" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/757553135840526376" target="_blank" rel="nofollow noopener noreferrer">Q: How can I use CML with my own Docker container?</a><a href="#q-how-can-i-use-cml-with-my-own-docker-container" aria-label="q how can i use cml with my own docker container permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In many of our CML docs and videos, we've shown how to get CML on your CI (continuous integration) runner via a Docker container that comes with everything installed. But this is not the only way to use CML, especially if you want workflows to run in your own Docker container.</p> <p>You can install CML via <code>npm</code>, either in your own Docker container or in your CI workflow (i.e., in your GitHub Actions <code>.yaml</code> or GitLab CI <code>.yml</code> workflow file).</p> <p>To install CML as a package, you'll want to run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">npm</span> i <span class="token parameter variable">-g</span> @dvcorg/cml</code></pre></div> <p>Note that you may need to install additional dependencies if you want to use DVC plots and Vega-Lite commands:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">sudo</span> <span class="token function">apt-get</span> <span class="token function">install</span> <span class="token parameter variable">-y</span> libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev <span class="token punctuation">\</span> librsvg2-dev libfontconfig-dev $ <span class="token function">npm</span> <span class="token function">install</span> <span class="token parameter variable">-g</span> vega-cli vega-lite</code></pre></div> <p>If you're installing CML as part of your workflow, you may need to install Node first- <a href="https://github.com/iterative/cml#install-cml-as-a-package" target="_blank" rel="nofollow noopener noreferrer">check out our docs</a> for how to do this in GitHub Actions and GitLab CI.</p> <h3 id="q-after-running-a-github-action-workflow-that-runs-a-dvc-pipeline-i-want-to-save-the-output-of-the-pipeline-why-doesnt-cml-automatically-save-the-output" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/757686601953312988" target="_blank" rel="nofollow noopener noreferrer">Q: After running a GitHub Action workflow that runs a DVC pipeline, I want to save the output of the pipeline. Why doesn't CML automatically save the output?</a><a href="#q-after-running-a-github-action-workflow-that-runs-a-dvc-pipeline-i-want-to-save-the-output-of-the-pipeline-why-doesnt-cml-automatically-save-the-output" aria-label="q after running a github action workflow that runs a dvc pipeline i want to save the output of the pipeline why doesnt cml automatically save the output permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>By design, artifacts generated in a CI workflow aren't saved anywhere- they disappear as soon as the runner shuts down. So a DVC pipeline executed in your CI system might produce outputs, like transformed datasets and model files, that will be lost at the end of the run. If you want to save them, there are a few methods.</p> <p>One approach is with auto-commits: a <code>git commit</code> at the end of your CI workflow to commit any new artifacts to your Git repository. However, auto-commits have a lot of downsides- they don't make sense for a lot of users, and generally, it's better to re-create outputs as needed than save them forever in your Git repo.</p> <p>We created the DVC <code>run-cache</code> in part <a href="https://stackoverflow.com/questions/61245284/is-it-necessary-to-commit-dvc-files-from-our-ci-pipelines" target="_blank" rel="nofollow noopener noreferrer">to solve this issue</a>. Here's how it works: you'll setup a DVC remote with access credentials passed to your GitHub Action/GitLab CI via CML (see, for example, <a href="https://github.com/iterative/cml_dvc_case/blob/master/.github/workflows/cml.yaml" target="_blank" rel="nofollow noopener noreferrer">this workflow</a>). Then you'll use the following protocol in your CI workflow (your workflow config file in GitHub/GitLab):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span> <span class="token parameter variable">--run-cache</span></span></code></pre></div> <p>When you use this design, any artifacts of <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>, such as models or transformed datasets, will be saved in DVC storage and indexed by the pipeline version that generated them. You can access them in your local workspace by running</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> <span class="token parameter variable">--run-cache</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span></span></code></pre></div> <p>While we think this is ideal for typical data science and machine learning workflows, there are other approaches too- if you want to go deeper exploring auto-commits, checkout the <a href="https://github.com/marketplace/actions/add-commit" target="_blank" rel="nofollow noopener noreferrer">Add & Commit GitHub Action</a>.</p> <h3 id="q-what-can-cml-do-that-circle-ci-cant-do" style="position:relative;"><a href="https://www.youtube.com/watch?v=9BgIDqAzfuA&lc=Ugylt6QR5ClmD8uHe4B4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: What can CML do that Circle CI can't do?</a><a href="#q-what-can-cml-do-that-circle-ci-cant-do" aria-label="q what can cml do that circle ci cant do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To be clear, CML isn't a competitor to Circle CI. Circle CI is more analogous to GitHub Actions or GitLab CI; it's a continuous integration system.</p> <p>CML is a toolkit that works with a continuous integration system to 1) provide big data management (via DVC & cloud storage), 2) help you write model metrics and data viz to comments in GitHub/Lab, and 3) orchestrate cloud resources for model training and testing. Currently, CML is only available for GitHub Actions and GitLab CI.</p> <p>So to sum it up: CML is not a standalone continuous integration system! It's a toolkit that works with existing systems, which in the future could include Circle CI, Jenkins, Bamboo, Azure DevOps Pipelines, and Travis CI. Feel free to <a href="https://github.com/iterative/cml/issues" target="_blank" rel="nofollow noopener noreferrer">open a feature request ticket</a>, or leave a 👍 on open requests, to "vote" for the integrations you'd like to see most.</p>https://dvc.org/blog/september-20-dvc-heartbeathttps://dvc.org/blog/september-20-dvc-heartbeatWed, 09 Sep 2020 00:00:00 GMT<h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="dmitry-on-software-engineering-daily" style="position:relative;">Dmitry on Software Engineering Daily<a href="#dmitry-on-software-engineering-daily" aria-label="dmitry on software engineering daily permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our CEO Dmitry Petrov was interviewed on the much-beloved Software Engineering Daily podcast! Host <a href="https://twitter.com/the_prion" target="_blank" rel="nofollow noopener noreferrer">Jeff Meyerson</a> kicked off the discussion:</p> <blockquote> <p>Code is version controlled through Git, the version control system originally built to manage the Linux codebase. For decades, software has been developed using git for version control. More recently, data engineering has become an unavoidable facet of software development. It is reasonable to ask–why are we not version controlling our data?</p> </blockquote> <p>For the rest of the episode, listen here!</p> <p> </p><section class="elp-content-holder"> <a href="https://softwareengineeringdaily.com/2020/08/24/data-version-control-with-dmitry-petrov/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data Version Control with Dmitry Petrov</h4> <div class="elp-description"></div> <div class="elp-link">softwareengineeringdaily.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-09-09/sedaily-3eb9a64f46034c9319af2b611a55202e.jpeg" alt="Data Version Control with Dmitry Petrov"> </div> </a> </section> <p></p> <h3 id="contributors-meetup" style="position:relative;">Contributor's meetup<a href="#contributors-meetup" aria-label="contributors meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Last week, we held a meetup for contributors to DVC! Core maintainer <a href="https://github.com/efiop" target="_blank" rel="nofollow noopener noreferrer">Ruslan Kupriev</a> hosted a get-together for folks who contribute new features, bug fixes, and more to the community. If you missed it, you can watch it on YouTube.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jUYSTERXxWg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="new-videos" style="position:relative;">New videos<a href="#new-videos" aria-label="new videos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We've released several new videos to our growing <a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">YouTube channel</a>- and cool news, we passed 1,000 subscribers! The support has been surprising in the best way possible. We're seeing a lot of repeat commenters and folks from the DVC meetups! It's been so rewarding to get positive feedback from the community and we're planning to build our YouTube presence even more.</p> <p><img src="https://media.giphy.com/media/ZE0JppdERv8t4jVCAt/giphy.gif" alt="Happy GIF"></p> <p><em>Even Skeletor finds joy in this.</em></p> <p>We now have 4 tutorials in our MLOps series. In the latest, we cover how to use your own GPU (on-premise or in the cloud) to run GitHub Actions workflows. Check it out and give it a try, the code examples are freely available :)</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/rVq-SCNyxVc?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>We also made our first ever "explainer" video to talk through how DVC works in five minutes.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/UbL7VUpv1Bs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>As always, video requests are welcome! Reach out and let us know what topics and tutorials you want to see covered. And we appreciate any likes, shares, and subscribes on our growing YouTube channel.</p> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="a-three-part-cml-series-featuring-r" style="position:relative;">A three-part CML series (featuring R!)<a href="#a-three-part-cml-series-featuring-r" aria-label="a three part cml series featuring r permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC ambassador <a href="https://twitter.com/mribeirodantas" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a> has published two of three tutorial blogs in a series on CML! Marcel's use case is especially cool because he's using R, plus some causal modeling related to his work in bioinformatics, with GitHub Actions.</p> <p>In Part I, Marcel introduces his project and how he uses DVC, CML and GitHub Actions together (with his custom R library).</p> <p> </p><section class="elp-content-holder"> <a href="https://mribeirodantas.xyz/blog/index.php/2020/08/10/continuous-machine-learning/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Continuous Machine Learning - Part I</h4> <div class="elp-description">by Marcel Ribeiro-Dantas</div> <div class="elp-link">mribeirodantas.xyz</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-09-09/MLOps-8126305fe5b093898314fd6250f4b95c.png" alt="Continuous Machine Learning - Part I"> </div> </a> </section> <p></p> <p>In Part II, Marcel takes a deeper dive into Docker. He explains how to create a your own Docker image and test it. This case should be helpful for folks who want to include the CML library in their own Docker container.</p> <p> </p><section class="elp-content-holder"> <a href="https://mribeirodantas.xyz/blog/index.php/2020/08/18/continuous-machine-learning-part-ii/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Continuous Machine Learning - Part II</h4> <div class="elp-description">by Marcel Ribeiro-Dantas</div> <div class="elp-link">mribeirodantas.xyz</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-09-09/docker_logo-08a71e88bc63e58e0b64be1a87f46d19.png" alt="Continuous Machine Learning - Part II"> </div> </a> </section> <p></p> <h3 id="real-python-talks-dvc" style="position:relative;">Real Python talks DVC<a href="#real-python-talks-dvc" aria-label="real python talks dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://twitter.com/kristijan_ivanc" target="_blank" rel="nofollow noopener noreferrer">Kristijan Ivancic</a> of <a href="realpython.com">Real Python</a>, a library of online Python tutorials and lessons, created a <em>seriously</em> impressive DVC tutorial (this thing is a beast 🐺- it has a table of contents!)</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0223cef84ac40de89a1dee595a176c51/39600/Real_Python.png" alt="Real Python" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>How cool is this artwork?</em></p> <p>And, the Real Python podcast discussed their DVC tutorial (plus the joys of version control for data!) on a recent episode.</p> <p> </p><section class="elp-content-holder"> <a href="https://realpython.com/podcasts/rpp/25/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Episode 25: Data Version Control in Python and Real Python Video Transcripts</h4> <div class="elp-description">The Real Python Podcast</div> <div class="elp-link">realpython.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-09-09/podcast_log-3fcbb7ce3ba571bd42ef1742701fcda4.png" alt="Episode 25: Data Version Control in Python and Real Python Video Transcripts"> </div> </a> </section> <p></p> <h3 id="recommended-reading" style="position:relative;">Recommended reading<a href="#recommended-reading" aria-label="recommended reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There's a lot of cool stuff happening out there in the data science world 🌏!</p> <ul> <li><a href="https://twitter.com/fab_clemente" target="_blank" rel="nofollow noopener noreferrer">Fabiana Clemente</a>, Chief Data Officer of <a href="https://ydata.ai/" target="_blank" rel="nofollow noopener noreferrer">YData</a>, published a blog for The Startup about four reasons to start using data version control- and, with her expertise in data privacy, she's especially well-qualified to explain the role of DVC in compliance and auditing! Check out her blog (it comes with a quick-start tutorial, too).</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/swlh/4-reasons-why-data-scientists-should-version-data-672aca5bbd0b" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">4 reasons why data scientists should version data</h4> <div class="elp-description">How to start data versioning using DVC</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-09-09/fabiana-1a47c481b8ffec6781d7125892845083.jpg" alt="4 reasons why data scientists should version data"> </div> </a> </section> <p></p> <ul> <li>Ryzal Kamis at the <a href="makerspace.aisingapore.org">AI Singapore Makerspace</a> shared a blog (the first of two!) about creating end-to-end CI/CD workflows for machine learning. In his first blog, Ryzal gives a high-level overview of the need for data version control and compares several tools in the space. Then he gives a walkthrough (quite easy to follow!) of how DVC fits in his workflow. We're eagerly awaiting the second installment of this series, which promises to bring more advanced automation scenarios and a CI/CD pipeline.</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://makerspace.aisingapore.org/2020/08/data-versioning-for-cd4ml-part-1/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data Versioning for CD4ML</h4> <div class="elp-description">Part 1</div> <div class="elp-link">makerspace.aisingapore.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-09-09/singapore-4ff5f8d09325f533e0f48806b348eb95.jpg" alt="Data Versioning for CD4ML"> </div> </a> </section> <p></p> <ul> <li><a href="https://www.infoworld.com/author/Isaac-Sacolick/" target="_blank" rel="nofollow noopener noreferrer">Isaac Sacolick</a>, contributing editor at InfoWorld, penned an article about the growing field of MLOps and its role in data-driven businesses. He writes:</li> </ul> <blockquote> <p>Too many data and technology implementations start with poor or no problem statements and with inadequate time, tools, and subject matter expertise to ensure adequate data quality. Organizations must first start with asking smart questions about big data, investing in dataops, and then using agile methodologies in data science to iterate toward solutions.</p> </blockquote> <p>Read the rest here:</p> <p> </p><section class="elp-content-holder"> <a href="https://www.infoworld.com/article/3570716/mlops-the-rise-of-machine-learning-operations.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLops: The rise of machine learning operations</h4> <div class="elp-description">Once machine learning models make it to production, they still need updates and monitoring for drift. A team to manage ML operations makes good business sense</div> <div class="elp-link">infoworld.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-09-09/infoworld-f3a590c3134bfbf003256a42ae130b55.png" alt="MLops: The rise of machine learning operations"> </div> </a> </section> <p></p> <p>Thanks everyone, that's a wrap for this month. Be safe, stay in touch, and get ready for pumpkin spice latte season 🎃.</p> <p><img src="https://media.giphy.com/media/EDpVRPFK5bjfq/giphy.gif" alt="Cat Fall GIF"></p>https://dvc.org/blog/august-20-community-gemshttps://dvc.org/blog/august-20-community-gemsThu, 27 Aug 2020 00:00:00 GMT<p>Here are some of our top Q&A's from around the community. With the launch of <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> earlier in the month, we've got some new ground to cover!</p> <h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-whats-the-relationship-between-the-dvc-remote-and-cache-if-i-have-an-external-cache-do-i-really-need-a-dvc-remote" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/747588572479094866" target="_blank" rel="nofollow noopener noreferrer">Q: What's the relationship between the DVC remote and cache? If I have an external cache, do I really need a DVC remote?</a><a href="#q-whats-the-relationship-between-the-dvc-remote-and-cache-if-i-have-an-external-cache-do-i-really-need-a-dvc-remote" aria-label="q whats the relationship between the dvc remote and cache if i have an external cache do i really need a dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can think of your DVC remote similar to your Git remote, but for data and model artifacts- it's a place to backup and share artifacts. It also gives you methods to push and pull those artifacts to and from your team.</p> <p>Your DVC cache (by default, it's located in <code>.dvc/cache</code>) serves a similar purpose to your Git objects database (which is by default located in <code>.git/objects</code>). They're both <em>local</em> caches that store files (including various versions of them) in a content-addressable format, which helps you quickly checkout different versions to your local workspace. The difference is that <code>.dvc/cache</code> is for data/model artifacts, and <code>.git/objects</code> is for code.</p> <p>Usually, your DVC remote is a superset of <code>.dvc/cache</code>- everything in your cache is a copy of something in your remote (though there may be files in your DVC remote that are not in your cache (and vice versa) if you have never attempted to <code>push</code> or <code>pull</code> them locally).</p> <p>In theory, if you are using an <a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">external cache</a>- meaning a DVC cache configured on a separate volume (like NAS, large HDD, etc.) outside your project path- and all your projects and all your teammates use that external cache, and you <em>know</em> that the storage is highly reliable, you don't need to also have a DVC remote. If you have any doubts about access to your external cache or its reliability, we'd recommend also keeping a remote.</p> <h3 id="q-one-of-my-files-is-an-output-of-a-dvc-pipeline-and-i-want-to-track-this-file-with-git-and-store-it-in-my-git-repository-since-it-isnt-very-big-how-can-i-make-this-work" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/732308317627613235" target="_blank" rel="nofollow noopener noreferrer">Q: One of my files is an output of a DVC pipeline, and I want to track this file with Git and store it in my Git repository since it isn't very big. How can I make this work?</a><a href="#q-one-of-my-files-is-an-output-of-a-dvc-pipeline-and-i-want-to-track-this-file-with-git-and-store-it-in-my-git-repository-since-it-isnt-very-big-how-can-i-make-this-work" aria-label="q one of my files is an output of a dvc pipeline and i want to track this file with git and store it in my git repository since it isnt very big how can i make this work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes! There are two approaches. We'll be assuming you have a pipeline stage that outputs a file, <code>myfile</code>.</p> <ul> <li>If you haven't declared the pipeline stage with <code>dvc run</code> yet, then you'll do it like this:</li> </ul> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> <span class="token operator"><</span>stage name<span class="token operator">></span> <span class="token parameter variable">-d</span> <span class="token operator"><</span>dependency<span class="token operator">></span> <span class="token parameter variable">-O</span> myfile</span></code></pre></div> <p>Note that instead of using the flag <code>-o</code> for specifying the output <code>myfile</code>, we're using <code>-O</code>- it's shorthand for <code>--outs-no-cache</code>. You can <a href="https://dvc.org/doc/command-reference/run#options" target="_blank" rel="nofollow noopener noreferrer">read about this flag in our docs</a>.</p> <ul> <li>If you've already created your pipeline stage, go into your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> and manually add the field <code>cache: false</code> to the stage as follows:</li> </ul> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">myfile</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span></code></pre></div> <p>Please note one special case: if you previously enabled hardlinks or symlinks in DVC via <a href="https://dvc.org/doc/command-reference/config"><code>dvc config cache</code></a>, you may need to run <a href="https://dvc.org/doc/command-reference/unprotect"><code>dvc unprotect myfile</code></a> to fully unlink <code>myfile</code> from your DVC cache. If you haven't enabled these types of file links (and if you're not sure, <em>you probably didn't!</em>), this step is unncessary. <a href="https://dvc.org/doc/command-reference/unprotect" target="_blank" rel="nofollow noopener noreferrer">See our docs for more.</a></p> <h3 id="q-can-i-change-my-paramsyaml-file-to-a-json" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/730614265051873370" target="_blank" rel="nofollow noopener noreferrer">Q: Can I change my <code>params.yaml</code> file to a <code>.json</code>?</a><a href="#q-can-i-change-my-paramsyaml-file-to-a-json" aria-label="q can i change my paramsyaml file to a json permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, this is straightforward- you change your <code>params.yaml</code> to <code>params.json</code> in your workspace, and then use it in <code>dvc run</code>:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-p</span> params.json:myparam <span class="token punctuation">..</span>.</span></code></pre></div> <p>Alternately, if your pipeline stage has already been created, you can manually edit your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file to replace <code>params.yaml</code> with <code>params.json</code>.</p> <p>For more about the <code>params.yaml</code> file, <a href="https://dvc.org/doc/start/experiments#defining-parameters" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p> <h3 id="q-is-there-a-guide-for-migrating-from-git-lfs-to-dvc" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/743559246599421974" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a guide for migrating from Git-LFS to DVC?</a><a href="#q-is-there-a-guide-for-migrating-from-git-lfs-to-dvc" aria-label="q is there a guide for migrating from git lfs to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We don't know of any published guide. One of our users shared their procedure for disabling LFS:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> lfs uninstall </span><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token function">rm</span> .gitattributes </span><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token function">rm</span> .lfsconfig</span></code></pre></div> <p>Then you can <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> files you wish to put in DVC tracking, and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> them to your remote. After that, <code>git commit</code> and you're good!</p> <p>Note that, if you're going to delete any LFS files, make sure you're certain the corresponding data has been transferred to DVC.</p> <h3 id="q-is-there-a-way-to-use-dvc-and-cml-to-validate-a-model-in-a-github-action-without-making-the-validation-data-available-to-the-user-opening-the-pull-request" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/739202123295883325" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to use DVC and CML to validate a model in a GitHub Action, without making the validation data available to the user opening the Pull Request?</a><a href="#q-is-there-a-way-to-use-dvc-and-cml-to-validate-a-model-in-a-github-action-without-making-the-validation-data-available-to-the-user-opening-the-pull-request" aria-label="q is there a way to use dvc and cml to validate a model in a github action without making the validation data available to the user opening the pull request permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We don't have special support for this use case, and there may be some security downsides to using a confidential validation dataset with someone else's code (be sure nothing in their code could expose your data!). But, there are ways to implement this if you're sure about it.</p> <p>One possible approach is to create a separate "data registry" repository using a private cloud bucket to store your validation dataset (<a href="https://dvc.org/doc/use-cases/data-registries#data-registries" target="_blank" rel="nofollow noopener noreferrer">see our docs about the why and how of data registries</a>). Your CI system can be setup to have access to the data registry via secrets (called "variables" in GitLab). Then when you run validation via <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro validate</code></a>, you could use <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> to pull the private data from the registry.</p> <p>The data is never exposed to the user in an interactive setting, only on the runner- and there it's ephemeral, meaning it does not exist once the runner shuts down.</p> <h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-sometimes-when-i-make-a-commit-on-a-branch-my-ci-workflow-isnt-triggered-whats-going-on" style="position:relative;"><a href="https://www.youtube.com/watch?v=9BgIDqAzfuA&lc=UgwKIYsCo194AErdeBJ4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: Sometimes when I make a commit on a branch, my CI workflow isn't triggered. What's going on?</a><a href="#q-sometimes-when-i-make-a-commit-on-a-branch-my-ci-workflow-isnt-triggered-whats-going-on" aria-label="q sometimes when i make a commit on a branch my ci workflow isnt triggered whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If your workflow is set to trigger on a push (as in the CML use cases), it isn't enough to <code>git commit</code> locally- you need to push to your GitHub or GitLab repository. If you want every commit to trigger your workflow, you'll need to push each one!</p> <p>What about if you <em>don't</em> want a push to trigger your worfklow? In GitLab, you can use the <a href="https://docs.gitlab.com/ee/ci/yaml/#skip-pipeline" target="_blank" rel="nofollow noopener noreferrer"><code>[ci skip]</code> flag</a>- make sure your commit message contains <code>[ci skip]</code> or <code>[skip ci]</code>, and GitLab CI won't run the pipeline in your <code>gitlab-ci.yml</code> file.</p> <p>In GitHub Actions, this flag isn't supported, so you can manually kill any workflows in the Actions dashboard. For a programmatic fix, <a href="https://timheuer.com/blog/skipping-ci-github-actions-workflows/" target="_blank" rel="nofollow noopener noreferrer">check out this workaround by Tim Heuer</a>.</p> <h3 id="q-can-i-do-the-bulk-of-my-model-training-outside-of-my-ci-system-and-then-share-the-result-with-cml" style="position:relative;"><a href="https://twitter.com/peterkuai/status/1295899690404175872" target="_blank" rel="nofollow noopener noreferrer">Q: Can I do the bulk of my model training outside of my CI system, and then share the result with CML?</a><a href="#q-can-i-do-the-bulk-of-my-model-training-outside-of-my-ci-system-and-then-share-the-result-with-cml" aria-label="q can i do the bulk of my model training outside of my ci system and then share the result with cml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Definitely! This is a desirable workflow in several cases:</p> <ul> <li>You have a preferred approach for experiment tracking (for example, DVC or MLFlow) that you want to keep using</li> <li>You don't want to set up a self-hosted runner to connect your computing resources to GitHub or GitLab</li> <li>Training time is on the order of days or more</li> </ul> <p>CML is very flexible, and one strong use case is for sanity checking and evaluating a model in a CI system post-training. When you have a model that you're satisifed with, you can check it into your CI system and use CML to evaluate the model in a production-like environment (such as a custom Docker container), report its behavior and informative metrics. Then you can decide if it's ready to be merged into your main branch.</p> <h3 id="q-can-i-make-a-cml-report-comparing-models-across-different-branches-of-a-project" style="position:relative;"><a href="https://github.com/iterative/cml/issues/188" target="_blank" rel="nofollow noopener noreferrer">Q: Can I make a CML report comparing models across different branches of a project?</a><a href="#q-can-i-make-a-cml-report-comparing-models-across-different-branches-of-a-project" aria-label="q can i make a cml report comparing models across different branches of a project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Definitely. This is what <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a> is for- like a <code>git diff</code>, but for model metrics instead of code. We made a video about how to do this in CML!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/xPncjKH6SPk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="q-in-the-function-cml-publish-it-looks-like-youre-uploading-published-files-to-httpsassetcmldev-why-dont-you-just-save-images-in-the-git-repository" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/745168931521822740" target="_blank" rel="nofollow noopener noreferrer">Q: In the function <code>cml publish</code>, it looks like you're uploading published files to <code>https://asset.cml.dev</code>. Why don't you just save images in the Git repository?</a><a href="#q-in-the-function-cml-publish-it-looks-like-youre-uploading-published-files-to-httpsassetcmldev-why-dont-you-just-save-images-in-the-git-repository" aria-label="q in the function cml publish it looks like youre uploading published files to httpsassetcmldev why dont you just save images in the git repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If an image file is created as part of your workflow, it's ephemeral- it doesn't exist outside of your CI runner, and will disappear when your runner is shut down. To include an image in a GitHub or GitLab comment, a link to the image needs to persist. You could commit the image to your repository, but typically, <a href="https://stackoverflow.com/questions/61245284/is-it-necessary-to-commit-dvc-files-from-our-ci-pipelines" target="_blank" rel="nofollow noopener noreferrer">it's undesireable to automatically commit results of a CI workflow</a>.</p> <p>We created a publishing service to help you host files for CML reports. Under the hood, our service uploads your file to an S3 bucket and uses a key-value store to share the file with you.</p> <p>This covers a lot of cases, but if the files you wish to publish can't be shared with our service for security or privacy reasons, you can emulate the <code>cml publish</code> function with your own storage. You would push your file to storage and include a link to its address in your markdown report.</p>https://dvc.org/blog/august-20-dvc-heartbeathttps://dvc.org/blog/august-20-dvc-heartbeatMon, 10 Aug 2020 00:00:00 GMT<p>Welcome to our August roundup of cool news, new releases, and recommended reading in the MLOps world!</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="cml-release" style="position:relative;">CML release<a href="#cml-release" aria-label="cml release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>At the beginning of July, we went live with a new project: <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning, or CML</a> for short. If you hadven't heard, CML is an open-source toolkit for adapting popular continuous integration systems like GitHub Actions and GitLab CI for machine learning and data science. This release marks a new stage for our organization: while CML can work with DVC, and both are built around Git, CML is designed for standalone use. That means we're supporting TWO projects now!</p> <p><img src="https://media.giphy.com/media/X5i2BoQeD9kWY/giphy.gif" alt="Threaten Ashley Olsen GIF"></p> <p>Luckily, we received plenty of encouraging and helpful feedback following the CML release. CML was on the front page of Hacker News for most of release day! We also got <a href="https://www.heise.de/news/Machine-Learning-CML-schickt-Daten-und-Modelltraining-in-die-Pipeline-4841023.html" target="_blank" rel="nofollow noopener noreferrer">covered on Heise</a>, a popular German IT news source. I (Elle, a proud part of the CML team!) also gave a talk presenting our approach as part of the MLOps World meeting, which is now available for online viewing.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/yp0su5mOeko?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>Of course, we're fielding lots of questions too! We've compiled some of the most common questions (and their answers!) in our last <a href="https://dvc.org/blog/july-20-community-gems" target="_blank" rel="nofollow noopener noreferrer">Community Gems post</a>, and CML developer <a href="https://github.com/DavidGOrtega" target="_blank" rel="nofollow noopener noreferrer">David G. Ortega</a> has written a tutorial for a much-asked-for use case: doing <a href="https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpus" target="_blank" rel="nofollow noopener noreferrer">continuous integration with on-demand GPUs</a>.</p> <p>If you have comments, questions, or feature requests about CML, we <em>really</em> want to hear from you. A few ways to be in touch:</p> <ul> <li>Open an <a href="https://github.com/iterative/cml/issues" target="_blank" rel="nofollow noopener noreferrer">issue on the project repo</a></li> <li>Drop by the <a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">CML Discord channel</a></li> <li>Send us <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">an email</a></li> </ul> <h3 id="july-meetup" style="position:relative;">July Meetup<a href="#july-meetup" aria-label="july meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Last week, we had another meetup! <a href="http://mribeirodantas.me/" target="_blank" rel="nofollow noopener noreferrer">DVC Ambassador Marcel</a> kicked us off with a short talk about how he's using DVC as part of his causal modeling approach to bioinformatics. It's cool stuff. Then, I talked a bit about CML and did some live-coding. The beauty of live-coding is getting to answer questions in real-time, and if you're totally new to the idea of continuous integration (or want to understand how CML works with GitHub Actions/GitLab CI) seeing a project in-action is one of the best ways to learn.</p> <p>You can watch a recording of the meetup online now (it's lightly edited to remove some pesky Zoom trolls), and <a href="https://www.meetup.com/DVC-Community-Virtual-Meetups" target="_blank" rel="nofollow noopener noreferrer">join our Meetup group</a> to get updates for the next one. In future meetups, we'd love to support community members sharing their work, so get in touch if you'd like to present.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/tnTPHG5seDs?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h3 id="new-video-series" style="position:relative;">New video series<a href="#new-video-series" aria-label="new video series permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We're starting up some new YouTube features! If you haven't seen our channel, <a href="https://www.youtube.com/channel/UC37rp97Go-xIX3aNFVHhXfQ" target="_blank" rel="nofollow noopener noreferrer">check it out and consider subscribing</a> for hands-on tutorials and demos. Our <a href="https://youtu.be/9BgIDqAzfuA" target="_blank" rel="nofollow noopener noreferrer">first video introduced continuous integration and GitHub Actions</a>, and the second showed <a href="https://youtu.be/kZKAuShWF0s" target="_blank" rel="nofollow noopener noreferrer">how to use DVC and free Google Drive storage to add external data storage to a GitHub project</a>.</p> <p>In the coming weeks, we'll be covering:</p> <ul> <li>Using CML and GitHub Actions with hardware for deep learning, like on-premise GPUs</li> <li>Understanding Vega plots and making data viz part of your CI system</li> <li>Some DVC basics to supplement our docs</li> </ul> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="spacy--dvc--️" style="position:relative;">SpaCy + DVC = ❤️<a href="#spacy--dvc--%EF%B8%8F" aria-label="spacy dvc ️ permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We're huge fans of a recent Python Bytes episode featuring <a href="https://twitter.com/_inesmontani" target="_blank" rel="nofollow noopener noreferrer">Ines Montani</a>, founder of Explosion and one of the makers of the incredible SpaCy library for NLP (seriously, I have the highest recommendations for SpaCy).</p> <blockquote> <p>My <a href="https://twitter.com/pythonbytes" target="_blank" rel="nofollow noopener noreferrer">@PythonBytes</a> episode is out now!</p> <p>🎙️ Listen here: <a href="https://t.co/fHLF2hR4cM" target="_blank" rel="nofollow noopener noreferrer">https://t.co/fHLF2hR4cM</a></p> <p>My picks of the week are:<br> 🐙 TextAttack by @jxmorris12: <a href="https://t.co/jySYrtzzp8" target="_blank" rel="nofollow noopener noreferrer">https://t.co/jySYrtzzp8</a><br> 🦉 Data Version Control (DVC) <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">@DVCorg</a>: <a href="https://t.co/3610F6kv8v" target="_blank" rel="nofollow noopener noreferrer">https://t.co/3610F6kv8v</a><br> 🐍 Built-in generic types in 3.9</p> <p>— Ines Montani 〰️ (@_inesmontani) <a href="https://twitter.com/_inesmontani/status/1286222512762871808" target="_blank" rel="nofollow noopener noreferrer">July 23, 2020</a></p> </blockquote> <p>Ines' episode discussed DVC, and DVC is going to be integrated with SpaCy in their 3.0 release. SpaCy + DVC is going to be a powerhouse and we can't wait.</p> <h3 id="take-a-stab-at-shtab" style="position:relative;">Take a stab at shtab<a href="#take-a-stab-at-shtab" aria-label="take a stab at shtab permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Another cool software project: <a href="https://cdcl.ml" target="_blank" rel="nofollow noopener noreferrer">Casper da Costa-Luis</a>, DVC contributor and creator of the popular <a href="https://github.com/tqdm/tqdm" target="_blank" rel="nofollow noopener noreferrer">tqdm library</a>, has published a tab-completion script generator for Python applications! <code>shtab</code>, as it's called, was originally designed for DVC, but Casper developed it into a generic tool that can be used for virtually any Python CLI application. Check out <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code> on GitHub</a> and read the release blog.</p> <p> </p><section class="elp-content-holder"> <a href="https://dvc.org/blog/shtab-completion-release" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">(Tab) Complete Any Python Application in 1 Minute or Less</h4> <div class="elp-description">We've made a painless tab-completion script generator for Python applications!</div> <div class="elp-link">dvc.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-08-10/shtab-63dfef1b63f0d3983a998c2f2a37e6fe.png" alt="(Tab) Complete Any Python Application in 1 Minute or Less"> </div> </a> </section> <p></p> <h3 id="dvc-10-migration-script" style="position:relative;">DVC 1.0 migration script<a href="#dvc-10-migration-script" aria-label="dvc 10 migration script permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our friends at <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> have released a script to help DVC users upgrade their pipelines to the new DVC 1.0 format! Says Simon, a DAGsHub engineer, in his tutorial:</p> <blockquote> <p>In this post, I'll walk you through the process of migrating your existing project from DVC ≤ 0.94 to DVC 1.X using a single automated script, and then demonstrate a way to check that your migration was successful.</p> </blockquote> <p>Read the blog and get migrating (but don't worry if you can't; DVC 1.0 is backwards compatible). </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/automatically-migrate-your-project-from-dvc-0-94-to-dvc-1-x-416a5b9e837b" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Automatically migrate your project from DVC≤ 0.94 to DVC 1.x</h4> <div class="elp-description">Migrating your project from DVC ≤ 0.94 to DVC 1.x can be a very involved process. Here’s an easy way to do it.</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-08-10/dagshub-d94acab82a6d235462cf66823321303b.jpg" alt="Automatically migrate your project from DVC≤ 0.94 to DVC 1.x"> </div> </a> </section> <p></p> <h3 id="recommended-reading" style="position:relative;">Recommended reading<a href="#recommended-reading" aria-label="recommended reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Here are some of our favorite blogs from around the internet 🌏.</p> <ul> <li><a href="https://deborahmesquita.com/" target="_blank" rel="nofollow noopener noreferrer">Déborah Mesquita</a>, data scientist (and an excellent writer to follow), published a tutorial about DVC pipelines that is truly deserving of the moniker "ultimate guide". It's a start-to-finish case study about a typical machine learning project, with DVC pipelines to automate everything from grabbing the data to training and evaluating a model. Also, it comes with a video tutorial if you prefer to watch instead of read!</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/the-ultimate-guide-to-building-maintainable-machine-learning-pipelines-using-dvc-a976907b2a1b" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The ultimate guide to building maintainable Machine Learning pipelines using DVC</h4> <div class="elp-description">Learn the principles for building maintainable Machine Learning pipelines using DVC</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-08-10/deborah-fb09cdac9dbd7a3985fab4a3f06e83fb.jpg" alt="The ultimate guide to building maintainable Machine Learning pipelines using DVC"> </div> </a> </section> <p></p> <ul> <li>Software engineer <a href="https://www.linkedin.com/in/vaithyanathan/" target="_blank" rel="nofollow noopener noreferrer">Vaithy Narayanan</a> created the first ever ☝️ CML user blog! Vaithy created a pipeline that covers data collection to model training and testing, and used CML to automate the pipeline execution whenever the project's GitHub repository is updated. He ends with some insightful discussion about the strengths and weaknesses of the approach.</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@karthik.vaithyanathan/using-continuous-machine-learning-to-run-your-ml-pipeline-eeeeacad69a3" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Using Continuous Machine Learning to Run Your ML Pipeline</h4> <div class="elp-description">Vaithy Narayanan</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-08-10/vaithy-12db755ef9cb1d18c60fe6d502f8f454.jpg" alt="Using Continuous Machine Learning to Run Your ML Pipeline"> </div> </a> </section> <p></p> <ul> <li> <p><a href="https://www.linkedin.com/in/ryan-w-gross/" target="_blank" rel="nofollow noopener noreferrer">Ryan Gross</a>, a VP at Pariveda Solutions, blogged about the future of data governance and the lessons from DevOps that might save the day. Honestly, you should probably start reading for this cover image alone.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/d1678417f2c4f696d8be116ddab483b4/39600/dataops.png" alt="dataops" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>DataOps is accurately depicted as a badass flaming eagle.</em> Check out the blog here:</p> </li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The Rise of DataOps (from the ashes of Data Governance)</h4> <div class="elp-description">Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive orders-of-magnitude improvements.</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-08-10/ryan-e96231d9b6f89548cf406226f82782a8.png" alt="The Rise of DataOps (from the ashes of Data Governance)"> </div> </a> </section> <p></p> <p>And, there's a <a href="https://locallyoptimistic.com/post/git-for-data-not-a-silver-bullet/?utm_campaign=Data_Elixir&utm_source=Data_Elixir_298" target="_blank" rel="nofollow noopener noreferrer">noteworthy counterpoint</a> by <a href="https://www.linkedin.com/in/michael-the-data-guy-kaminsky/" target="_blank" rel="nofollow noopener noreferrer">Michael Kaminsky</a>. Read them both!</p> <p>Thanks everyone, that's it for this month. We hope you're staying safe and making cool things!</p> <p><img src="https://media.giphy.com/media/35EsMpEfGHkVoHbNTU/giphy.gif" alt="Reaction GIF by MOODMAN"></p>https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpushttps://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpusFri, 07 Aug 2020 00:00:00 GMT<p>When creating your CI/CD workflow for a machine learning (ML) project, you might find that by default, neither GitHub Actions nor GitLab CI provides the computing capabilities you need- like GPUs, high memory instances, or multiple cores.</p> <p>To overcome this hardware hurdle, one practical approach is to use self-hosted runners: runners that you manage, but are accessible to your CI/CD system for executing jobs. It could be an EC2 instance or the GPU under your desk. In our <a href="https://dvc.org/blog/cml-release" target="_blank" rel="nofollow noopener noreferrer">recently-released project</a>, Continuous Machine Learning (CML), our Docker image acts as a thin wrapper over GitLab and GitHub runners, adding some extra capabilities.</p> <p>Here are some benefits of using CML with a self-hosted runner:</p> <ol> <li> <p><strong>Easy to use.</strong> Working the same way for both GitLab and GitHub.</p> </li> <li> <p><strong>Get out of dependency hell.</strong> We tend to install packages (on top of packages, on top of packages…) while we‘re experimenting with models. In ML in particular, we can be dependent on drivers AND libraries, and sometimes precise versions of them (CUDA and TensorFlow, anyone?). Your CI workflow will install all the dependencies in the containerised runner leaving your machine always clean.</p> </li> <li> <p><strong>Security.</strong> If your repo is public your runners could be accessed by anyone that could add <a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories" target="_blank" rel="nofollow noopener noreferrer">scripts that exploits your machine</a>. With the containerised runner you are restricting the access to your real machine.</p> </li> <li> <p><strong>Gain reproducibility.</strong> One of the biggest technical debts in the ML space is reproducibility. A few weeks post-experiment, we often discover that trying to put your model back in shape is a pain. Looking at our repo, it’s not obvious what data or training infrastructure or dependencies went into a given result. When you move your ML experiments into a CI/CD system you are making a contract of the dependencies and hardware used for your experiment. Having that contract isolated by the containerised runner, your experiment is perfectly reproducible by anyone in the future.</p> </li> </ol> <h2 id="hands-on-gpu-self-hosted-runners-101" style="position:relative;">Hands on GPU Self-hosted runners 101<a href="#hands-on-gpu-self-hosted-runners-101" aria-label="hands on gpu self hosted runners 101 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="1-install-nvidia-drivers-and-nvidia-docker-in-your-machine-ubuntu-1804" style="position:relative;">1) Install nvidia drivers and nvidia-docker in your machine (ubuntu 18.04)<a href="#1-install-nvidia-drivers-and-nvidia-docker-in-your-machine-ubuntu-1804" aria-label="1 install nvidia drivers and nvidia docker in your machine ubuntu 1804 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">curl</span> <span class="token parameter variable">-s</span> <span class="token parameter variable">-L</span> https://nvidia.GitHub.io/nvidia-docker/gpgkey <span class="token operator">|</span> <span class="token function">sudo</span> apt-key <span class="token function">add</span> - <span class="token operator">&&</span> <span class="token punctuation">\</span> <span class="token function">curl</span> <span class="token parameter variable">-s</span> <span class="token parameter variable">-L</span> https://nvidia.GitHub.io/nvidia-docker/ubuntu18.04/nvidia-docker.list <span class="token operator">|</span> <span class="token function">sudo</span> <span class="token function">tee</span> /etc/apt/sources.list.d/nvidia-docker.list <span class="token operator">&&</span> <span class="token punctuation">\</span> <span class="token function">sudo</span> <span class="token function">apt</span> update <span class="token operator">&&</span> <span class="token function">sudo</span> <span class="token function">apt</span> <span class="token function">install</span> <span class="token parameter variable">-y</span> ubuntu-drivers-common <span class="token operator">&&</span> <span class="token punctuation">\</span> <span class="token function">sudo</span> ubuntu-drivers autoinstall <span class="token operator">&&</span> <span class="token punctuation">\</span> <span class="token function">sudo</span> <span class="token function">apt</span> <span class="token function">install</span> <span class="token parameter variable">-y</span> nvidia-container-toolkit <span class="token operator">&&</span> <span class="token punctuation">\</span> <span class="token function">sudo</span> systemctl restart <span class="token function">docker</span></span></code></pre></div> <p>You can test that your gpus are up and running with the following command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> run <span class="token parameter variable">--gpus</span> all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi</span></code></pre></div> <p>We should see something like this: <span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 594px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9ba66893a70af142402136bb0861e501/39600/nvidia-smi-output.png" alt="nvidia smi output" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="2-start-your-self-hosted-runner" style="position:relative;">2) Start your self-hosted runner<a href="#2-start-your-self-hosted-runner" aria-label="2 start your self hosted runner permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>With CML docker images launching your own self-hosted runner is very easy. These images have CML and DVC preinstalled (among other perks), plus CUDA drivers. That's all. You can clone these images and add your own dependencies to better mimic your own production environment.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">docker</span> run <span class="token parameter variable">--name</span> myrunner <span class="token parameter variable">-d</span> <span class="token parameter variable">--gpus</span> all <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_IDLE_TIMEOUT</span><span class="token operator">=</span><span class="token number">1800</span> <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_LABELS</span><span class="token operator">=</span>cml,gpu <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">RUNNER_REPO</span><span class="token operator">=</span><span class="token variable">$my_repo_url</span> <span class="token punctuation">\</span> <span class="token parameter variable">-e</span> <span class="token assign-left variable">repo_token</span><span class="token operator">=</span><span class="token variable">$my_repo_token</span> <span class="token punctuation">\</span> iterativeai/cml:0-dvc2-base1-gpu runner</span></code></pre></div> <p>where:</p> <p><code>RUNNER_IDLE_TIMEOUT</code> is the time in seconds that the runner is going to be idle at most waiting for jobs to come, if no one comes the runner shuts down and unregisters from your repo.</p> <p><code>RUNNER_LABELS</code> a comma delimited list of labels that we are setting in our workflow that the jobs will wait for.</p> <p><code>RUNNER_REPO</code> is the url of your GitLab or GitHub repo. repo_token is the personal token generated for your GitHub or GitLab repo. Note that for GitHub you must check <code>workflow</code> along with <code>repo</code>.</p> <p>If everything went fine we should see a runner registered in our repo.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0f4c6f8d9921fd73fe754a73ae76b04e/39600/registered-cml-runner-github.png" alt="registered cml runner github" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 459px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bff63d0fe6b853a4c80d71ec496f5b4a/39600/registered-cml-runner-gitlab.png" alt="registered cml runner gitlab" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="3-setup-your-github-actions-or-gitlab-workflow-yaml-file-to-use-the-runner-and-commit-your-changes" style="position:relative;">3) Setup your GitHub Actions or GitLab workflow yaml file to use the runner and commit your changes.<a href="#3-setup-your-github-actions-or-gitlab-workflow-yaml-file-to-use-the-runner-and-commit-your-changes" aria-label="3 setup your github actions or gitlab workflow yaml file to use the runner and commit your changes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>GitLab</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">tags</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> cml <span class="token punctuation">-</span> gpu <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> echo 'Hi from CML<span class="token tag">!'</span> <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token punctuation">-</span> cml send<span class="token punctuation">-</span>comment report.md</code></pre></div> <p>GitHub</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>my<span class="token punctuation">-</span>model <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>self<span class="token punctuation">-</span>hosted<span class="token punctuation">,</span> cml<span class="token punctuation">,</span> gpu<span class="token punctuation">]</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> cml_run <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> echo 'Hi from CML!' >> report.md cml send-comment report.md</span></code></pre></div> <p>Congrats! At this point you have done all the steps to have your GPUs up and running with CML.</p> <h1 id="limitations-and-future-directions" style="position:relative;">Limitations and future directions<a href="#limitations-and-future-directions" aria-label="limitations and future directions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>There are still some limitations to be solved at this stage:</p> <ul> <li> <p>GitHub Actions <a href="https://docs.github.com/en/actions/getting-started-with-github-actions/about-github-actions#usage-limits" target="_blank" rel="nofollow noopener noreferrer">can’t run a workflow longer than 72 hours</a>.</p> </li> <li> <p>Self-hosted runners <a href="https://GitLab.com/GitLab-org/GitLab/-/issues/229851#note_390371734" target="_blank" rel="nofollow noopener noreferrer">don’t behave well when they disconnect from the repo</a>, limiting the possibilities with preemptible instances (also known as spot instances).</p> </li> </ul> <p>We’re working on these issues (see issues <a href="https://github.com/iterative/cml/issues/161" target="_blank" rel="nofollow noopener noreferrer">#161</a>, <a href="https://github.com/iterative/cml/issues/174" target="_blank" rel="nofollow noopener noreferrer">#174</a>, and <a href="https://github.com/iterative/cml/issues/208" target="_blank" rel="nofollow noopener noreferrer">#208</a>) both in terms of CML and DVC capabilities. So keep watching this space for updates!</p> <hr> <p>We started CML to help teams deal with the complexity of ML more effectively- continuous integration is a proven approach to keeping projects agile even as the team size, number of experiments, and number of dependencies increase. Treating experiments like potential new features in a software project opens up many possibilities for improving our engineering practices. We’re looking forward to an era when ML experiments can be created, logged, and merged into production-ready code in minutes, not days or weeks.</p>https://dvc.org/blog/july-20-community-gemshttps://dvc.org/blog/july-20-community-gemsFri, 31 Jul 2020 00:00:00 GMT<p>Here are some of our top Q&A's from around the community. With the launch of <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a> earlier in the month, we've got some new ground to cover!</p> <h2 id="dvc-questions" style="position:relative;">DVC questions<a href="#dvc-questions" aria-label="dvc questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-recently-i-set-up-a-global-dvc-remote-where-can-i-find-the-config-file" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/717673618217238598" target="_blank" rel="nofollow noopener noreferrer">Q: Recently, I set up a global DVC remote. Where can I find the config file?</a><a href="#q-recently-i-set-up-a-global-dvc-remote-where-can-i-find-the-config-file" aria-label="q recently i set up a global dvc remote where can i find the config file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When you <a href="https://dvc.org/doc/command-reference/remote/list#options" target="_blank" rel="nofollow noopener noreferrer">create a global DVC remote</a>, a config file will be created in <code>~/.config/dvc/config</code> instead of your project directory (i.e., <code>.dvc/config</code>).</p> <p>Note that on a Windows system, the config file will be created at <code>C:\Users\<username>\AppData\Local\iterative\dvc\config</code>.</p> <h3 id="q-im-working-on-a-collaborative-project-and-i-use-dvc-pull-to-sync-my-local-workspace-with-the-project-repository-then-i-try-running-dvc-repro-but-get-an-error-dvcyaml-does-not-exist-no-one-else-on-my-team-is-having-this-issue-any-ideas" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/731188065078345799" target="_blank" rel="nofollow noopener noreferrer">Q: I'm working on a collaborative project, and I use <code>dvc pull</code> to sync my local workspace with the project repository. Then, I try running <code>dvc repro</code>, but get an error: <code>dvc.yaml does not exist</code>. No one else on my team is having this issue. Any ideas?</a><a href="#q-im-working-on-a-collaborative-project-and-i-use-dvc-pull-to-sync-my-local-workspace-with-the-project-repository-then-i-try-running-dvc-repro-but-get-an-error-dvcyaml-does-not-exist-no-one-else-on-my-team-is-having-this-issue-any-ideas" aria-label="q im working on a collaborative project and i use dvc pull to sync my local workspace with the project repository then i try running dvc repro but get an error dvcyaml does not exist no one else on my team is having this issue any ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This error suggests there is no <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file in your project. Most likely, this means your teammates are using DVC version 0.94 or earlier, before the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> standard was introduced. Meanwhile, it sounds like you're using version 1.0 or later. You can check by running</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc version</span></span></code></pre></div> <p>The best solution is for your whole team to upgrade to the latest version- and there's an easy <a href="https://towardsdatascience.com/automatically-migrate-your-project-from-dvc-0-94-to-dvc-1-x-416a5b9e837b" target="_blank" rel="nofollow noopener noreferrer">migration script to help you make the move</a>. If for some reason this won't work for your team, you can either downgrade to a previous version, or use a workaround:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> <span class="token operator"><</span>.dvc file<span class="token operator">></span></span></code></pre></div> <p>substituting the appropriate <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file for your pipeline. DVC 1.0 is backwards compatible, so pipelines created with previous versions will still run.</p> <h3 id="q-does-the-dvc-installer-for-windows-also-include-the-dependencies-for-using-cloud-storage-like-s3-and-gcp" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/715717911574216735" target="_blank" rel="nofollow noopener noreferrer">Q: Does the DVC installer for Windows also include the dependencies for using cloud storage, like S3 and GCP?</a><a href="#q-does-the-dvc-installer-for-windows-also-include-the-dependencies-for-using-cloud-storage-like-s3-and-gcp" aria-label="q does the dvc installer for windows also include the dependencies for using cloud storage like s3 and gcp permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you're installing DVC from binary-such as the <code>dvc.exe</code> <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">downloadable on the DVC homepage</a>- all the standard dependencies are included. You shouldn't need to use <code>pip</code> to install extra packages (like <code>boto</code> for S3 storage).</p> <h3 id="q-is-there-a-way-to-setup-my-dvc-remote-so-i-can-manually-download-files-from-it-without-going-through-dvc" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/717458695709130764" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to setup my DVC remote so I can manually download files from it without going through DVC?</a><a href="#q-is-there-a-way-to-setup-my-dvc-remote-so-i-can-manually-download-files-from-it-without-going-through-dvc" aria-label="q is there a way to setup my dvc remote so i can manually download files from it without going through dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When DVC adds a file to a remote repository (such as an S3 bucket, or an SSH file server), there's only one change happening: DVC calculates an md5 for the file and renames it with that md5. In technical terms, it's storing files in a "content-addressable way". That means if you know the hash of a file, you can locate it in your DVC remote and manually download it.</p> <p>To find the hash for a given file, say <code>data.csv</code>, you can look in the corresponding DVC file:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data.csv.dvc</span></code></pre></div> <p>Another approach is using a built-in DVC function:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--show-url</span> <span class="token builtin class-name">.</span> data.csv</span></code></pre></div> <p>You can read more about <a href="https://dvc.org/doc/command-reference/get#--show-url"><code>dvc get --show-url</code></a> in <a href="https://dvc.org/doc/command-reference/get#options" target="_blank" rel="nofollow noopener noreferrer">our docs</a>. Note that this functinality is also part of our Python API, so you can locate the path to a file in your remote within a Python environment. <a href="https://dvc.org/doc/api-reference/get_url" target="_blank" rel="nofollow noopener noreferrer">Check out our API docs!</a></p> <h3 id="q-by-default-each-dvc-project-has-its-own-cache-in-the-project-repository-to-save-space-im-thinking-about-locally-creating-a-single-cache-folder-and-letting-multiple-project-repositories-point-there-will-this-work" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/563406153334128681/736164141701791815" target="_blank" rel="nofollow noopener noreferrer">Q: By default, each DVC project has its own cache in the project repository. To save space, I'm thinking about locally creating a single cache folder and letting multiple project repositories point there. Will this work?</a><a href="#q-by-default-each-dvc-project-has-its-own-cache-in-the-project-repository-to-save-space-im-thinking-about-locally-creating-a-single-cache-folder-and-letting-multiple-project-repositories-point-there-will-this-work" aria-label="q by default each dvc project has its own cache in the project repository to save space im thinking about locally creating a single cache folder and letting multiple project repositories point there will this work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, we hear from many users who have created a <a href="https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache#configure-the-shared-cache" target="_blank" rel="nofollow noopener noreferrer">shared cache</a>. Because of the way DVC uses content-addressable filenames, you won't encounter issues like accidentally overwriting files from one project with another.</p> <p>A possible issue is that a shared cache will grant all teammates working on a given project access to the data from all other projects using that cache. If you have sensitive data, you can create different caches for projects involving private and public data.</p> <p>To learn more about setting your cache directory location, <a href="https://dvc.org/doc/command-reference/cache/dir" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p> <h2 id="cml-questions" style="position:relative;">CML questions<a href="#cml-questions" aria-label="cml questions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="q-i-use-bitbucket-will-cml-work-for-me" style="position:relative;">Q: I use Bitbucket. Will CML work for me?<a href="#q-i-use-bitbucket-will-cml-work-for-me" aria-label="q i use bitbucket will cml work for me permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The first release of CML is compatible with GitHub and GitLab. We've seen <a href="https://github.com/iterative/cml/issues/140" target="_blank" rel="nofollow noopener noreferrer">many requests for Bitbucket support</a>, and we're actively investigating how to add this. Stay tuned.</p> <h3 id="q-i-have-on-premise-gpus-can-cml-use-them-to-execute-pipelines" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/730070747388706867" target="_blank" rel="nofollow noopener noreferrer">Q: I have on-premise GPUs. Can CML use them to execute pipelines?</a><a href="#q-i-have-on-premise-gpus-can-cml-use-them-to-execute-pipelines" aria-label="q i have on premise gpus can cml use them to execute pipelines permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yep! You can use on-premise compute resources by configuring them as self-hosted runners. See <a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners" target="_blank" rel="nofollow noopener noreferrer">GitHub</a> and <a href="https://docs.gitlab.com/runner/" target="_blank" rel="nofollow noopener noreferrer">GitLab</a>'s official docs for more details and setup instructions.</p> <h3 id="q-im-building-a-workflow-that-deploys-a-gcp-compute-engine-instance-but-i-can-only-find-examples-with-aws-ec2-in-the-cml-docs-what-do-i-do" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/730688592787275806" target="_blank" rel="nofollow noopener noreferrer">Q: I'm building a workflow that deploys a GCP Compute Engine instance, but I can only find examples with AWS EC2 in the CML docs. What do I do?</a><a href="#q-im-building-a-workflow-that-deploys-a-gcp-compute-engine-instance-but-i-can-only-find-examples-with-aws-ec2-in-the-cml-docs-what-do-i-do" aria-label="q im building a workflow that deploys a gcp compute engine instance but i can only find examples with aws ec2 in the cml docs what do i do permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There is a slight difference in the way CML handles credentials for AWS and GCP, and that means you'll have to modify your workflow file slightly. We've added an example workflow for GCP to our <a href="https://github.com/iterative/cml#allocating-cloud-resources-with-cml" target="_blank" rel="nofollow noopener noreferrer">project README</a>.</p> <p>We've updated our <a href="https://github.com/iterative/cml_cloud_case#using-a-different-cloud-service" target="_blank" rel="nofollow noopener noreferrer">cloud compute use case repository docs</a> to cover a GCP example.</p> <p>Note that for Azure, the workflow will be the same as for AWS. You'll only have to change the arguments to <code>docker-machine</code>.</p> <h3 id="q-i-dont-see-any-installation-instructions-in-the-cml-docs-am-i-missing-something" style="position:relative;"><a href="https://discordapp.com/channels/485586884165107732/728693131557732403/733659483758133269" target="_blank" rel="nofollow noopener noreferrer">Q: I don't see any installation instructions in the CML docs. Am I missing something?</a><a href="#q-i-dont-see-any-installation-instructions-in-the-cml-docs-am-i-missing-something" aria-label="q i dont see any installation instructions in the cml docs am i missing something permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Nope, there's no installation unless you wish to install CML in your own Docker image. As long as you are using GitHub Actions or GitLab CI with the CML Docker images, no other steps are needed.</p> <p>If you're creating your own Docker image to be used in a GitHub Action or GitLab CI pipeline, you can add CML to your image via npm:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">$ <span class="token function">npm</span> i <span class="token parameter variable">-g</span> @dvcorg/cml</code></pre></div> <h3 id="q-can-i-use-cml-with-mlflow" style="position:relative;"><a href="https://www.youtube.com/watch?v=9BgIDqAzfuA&lc=Ugw-VxQqAaqi9hmqB3t4AaABAg" target="_blank" rel="nofollow noopener noreferrer">Q: Can I use CML with MLFlow?</a><a href="#q-can-i-use-cml-with-mlflow" aria-label="q can i use cml with mlflow permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>CML is designed to integrate with lots of tools that ML teams are already familiar with. For example, we set up a wrapper to use CML with Tensorboard, so you get a link to your Tensorboard in a PR whenever your model is training (<a href="https://github.com/iterative/cml_tensorboard_case/pull/3" target="_blank" rel="nofollow noopener noreferrer">check out the use case</a>).</p> <p>While we haven't yet tried to create a use case with MLFlow in particular, we think a similar approach could work. We could imagine using MLFlow for hyperparameter searching, for example, and then checking in your best model with Git to a CI system for evaluation in a production-like environment. CML could help you orchestrate compute resources for model evaluation in your custom environment, pulling the model and any validation data from cloud storage, and reporting the results in a PR.</p> <p>If this is something you're interested in, make an issue on our project repository to tell us more about your project and needs- that lets us know it's a priority in the community.</p> <h3 id="q-are-there-more-tutorial-videos-coming" style="position:relative;">Q: Are there more tutorial videos coming?<a href="#q-are-there-more-tutorial-videos-coming" aria-label="q are there more tutorial videos coming permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes! We recently launched <a href="https://dvc.org/blog/first-mlops-tutorial" target="_blank" rel="nofollow noopener noreferrer">our first CML tutorial video</a>, and a lot of folks let us know they want more. We're aiming to release a new video every week or so in the coming months. Topics will include:</p> <ul> <li>Using DVC to push and pull data from cloud storage to your CI system</li> <li>Using CML with your on-premise hardware</li> <li>Building a data dashboard in GitHub & GitLab for monitoring changes in dynamic datasets</li> <li>Provisioning cloud compute from your CI system</li> <li>Creating a custom Docker container for testing models in a production-like environment</li> </ul> <p>We really want to know what use cases, questions, and issues are most important to you. This will help us make videos that are most relevant to the community! If you have a suggestion or idea, no matter how small, we want to know. Leave a <a href="https://youtu.be/9BgIDqAzfuA" target="_blank" rel="nofollow noopener noreferrer">comment on our videos</a>, <a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">reach out on Twitter</a>, or <a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">ping us in Discord</a>.</p>https://dvc.org/blog/shtab-completion-releasehttps://dvc.org/blog/shtab-completion-releaseMon, 27 Jul 2020 00:00:00 GMT<p>Command line tools are powerful. Things like <a href="https://en.wikipedia.org/wiki/Make_(software)" target="_blank" rel="nofollow noopener noreferrer"><code>make</code></a> have manual pages spanning, well, <a href="https://www.gnu.org/software/make/manual/make.html#Options-Summary" target="_blank" rel="nofollow noopener noreferrer">pages</a>, while just the list of <a href="https://git-scm.com" target="_blank" rel="nofollow noopener noreferrer"><code>git</code></a> subcommands is longer than can fit on a standard <code>80 x 24</code> terminal screen.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> <span class="token operator"><</span>TAB<span class="token operator">></span> </span>add filter-branch rebase am format-patch reflog annotate fsck relink ... describe prco unassume --More--</code></pre></div> <p>Notice the <code>--More--</code> at the bottom? That's the joy of pagination.</p> <p>Notice the <code><TAB></code> at the top? That represents actually pressing the tab key. Ah, the joy of shell tab completion.</p> <p>Tab completion is an indispensable part of writing anything on the command-line. Personally, I can't imagine trying to <code>git co</code> (aliased to <code>git checkout</code>) a branch without <code><TAB></code> to do the heavy lifting. <a href="https://en.wikipedia.org/wiki/Letter_frequency" target="_blank" rel="nofollow noopener noreferrer">They say</a> "E" is the most common vowel, and "T" the most common consonant. My keyboard use probably looks more like this:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/2eb5660d4bd9f2a134149c2995edb0ce/065c3/key-frequencies.png" alt="key frequencies" title="Yes, I use vim" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>My key usage</em></p> <p>Now, there's a tool called <code>dvc</code> which is like <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">Git for data</a>. It can be viewed as a cross-platform combination of <a href="https://git-scm.com" target="_blank" rel="nofollow noopener noreferrer"><code>git</code></a> and <a href="https://en.wikipedia.org/wiki/Make_(software)" target="_blank" rel="nofollow noopener noreferrer"><code>make</code></a> designed for handling big data and multiple cloud storage repositories, as well as tracking machine learning experiments. As you can imagine, supporting that many buzzwords means it also has a large number of subcommands and options.</p> <p><em>Every time a new feature is added, maintainers and contributors have to update tab completion scripts for multiple supported shells. At best, it's a pain, and at worst, error-prone. If you've worked on maintaining CLI applications, you'll sympathise.</em></p> <p>Surely the parser code you've written is informative enough to automate tab completion? Surely you shouldn't have to maintain and synchronise separate tab completion scripts?</p> <p>Good news: <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code></a> is a new tool which magically does all of this work.</p> <p>Any Python CLI application using <a href="https://docs.python.org/library/argparse" target="_blank" rel="nofollow noopener noreferrer"><code>argparse</code></a>, <a href="https://pypi.org/project/docopt" target="_blank" rel="nofollow noopener noreferrer"><code>docopt</code></a>, or <a href="https://pypi.org/project/argopt" target="_blank" rel="nofollow noopener noreferrer"><code>argopt</code></a> can have tab completion for free!</p> <p>Simply hand your parser object to <code>shtab</code> (either via the CLI or the Python API), and a tab completion script will be generated for your preferred shell. It's as easy as:</p> <ul> <li>CLI: <code>shtab --shell=bash myprogram.main.parser</code>, or</li> <li>Python API: <code>import shtab; print(shtab.complete(parser, shell="bash"))</code>.</li> </ul> <h3 id="argparse-example" style="position:relative;"><code>argparse</code> example<a href="#argparse-example" aria-label="argparse example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Suppose you have some code in a module <code>hello.main</code>:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> argparse <span class="token keyword">def</span> <span class="token function">get_main_parser</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> parser <span class="token operator">=</span> argparse<span class="token punctuation">.</span>ArgumentParser<span class="token punctuation">(</span>prog<span class="token operator">=</span><span class="token string">"hello"</span><span class="token punctuation">)</span> parser<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span> <span class="token string">"who"</span><span class="token punctuation">,</span> <span class="token builtin">help</span><span class="token operator">=</span><span class="token string">"good question"</span><span class="token punctuation">,</span> nargs<span class="token operator">=</span><span class="token string">"?"</span><span class="token punctuation">,</span> default<span class="token operator">=</span><span class="token string">"world"</span><span class="token punctuation">)</span> parser<span class="token punctuation">.</span>add_argument<span class="token punctuation">(</span> <span class="token string">"--what"</span><span class="token punctuation">,</span> <span class="token builtin">help</span><span class="token operator">=</span><span class="token string">"a better question"</span><span class="token punctuation">,</span> default<span class="token operator">=</span><span class="token string">"hello"</span><span class="token punctuation">,</span> choices<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"hello"</span><span class="token punctuation">,</span> <span class="token string">"goodbye"</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">return</span> parser <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span> parser <span class="token operator">=</span> get_main_parser<span class="token punctuation">(</span><span class="token punctuation">)</span> args <span class="token operator">=</span> parser<span class="token punctuation">.</span>parse_args<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"{}, {}!"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>args<span class="token punctuation">.</span>what<span class="token punctuation">,</span> args<span class="token punctuation">.</span>who<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div> <p>To get tab completion for <code>bash</code>, simply install <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code></a> and then run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">shtab <span class="token parameter variable">--shell</span><span class="token operator">=</span>bash hello.main.get_main_parser <span class="token punctuation">\</span> <span class="token operator">|</span> <span class="token function">sudo</span> <span class="token function">tee</span> <span class="token string">"<span class="token environment constant">$BASH_COMPLETION_COMPAT_DIR</span>"</span>/hello <span class="token operator">></span>/dev/null</code></pre></div> <p>Zsh user? Not a problem. Simply run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">shtab <span class="token parameter variable">--shell</span><span class="token operator">=</span>zsh hello.main.get_main_parser <span class="token punctuation">\</span> <span class="token operator">|</span> <span class="token function">sudo</span> <span class="token function">tee</span> /usr/local/share/zsh/site-functions/_hello <span class="token operator">></span>/dev/null <span class="token comment"># note the underscore `_` prefix in the filename</span></code></pre></div> <p>Handily you can install <code>shtab</code>'s own completions by following the above examples replacing <code>hello</code> with <code>shtab</code>.</p> <p><img src="https://dvc.org/2020-07-27/dvc-3857db37e1b5aeb81848451e82007f50.gif" alt=""><em><code>shtab</code>-driven <code>dvc</code> completion in <code>bash</code> and <code>zsh</code></em></p> <p>Using <code>shtab</code>, here's what <a href="https://dvc.org/doc/install/completion" target="_blank" rel="nofollow noopener noreferrer"><code>dvc</code>'s completion</a> looks like when installed:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">% dvc <TAB> Completing dvc commands add -- Track data files or directories with DVC. cache -- Manage cache settings. checkout -- Checkout data files from cache. commit -- Save changed data to cache and update DVC-files. completion -- Prints out shell tab completion scripts. At Top: Hit TAB for more, or the character to insert</code></pre></div> <p>All completion suggestions guaranteed in-sync with the code! The maintainers of <code>dvc</code> were very surprised to find no less than <a href="https://github.com/iterative/dvc/commits/main/scripts/completion" target="_blank" rel="nofollow noopener noreferrer">84 commits</a> touching their old completion scripts. Such churn is now a thing of the past!</p> <p>You might notice one of the subcommands provided by <code>dvc</code> is <a href="https://dvc.org/doc/install/completion" target="_blank" rel="nofollow noopener noreferrer"><code>completion</code></a>. Here's a quick example of how to provide such convenience for users:</p> <h3 id="integrating-library-example" style="position:relative;">Integrating library example<a href="#integrating-library-example" aria-label="integrating library example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Feeling minimal? How about adding <code>import shtab</code> to your application itself for a cleaner user interface? And let's use <a href="https://pypi.org/project/argopt" target="_blank" rel="nofollow noopener noreferrer"><code>argopt</code></a> to convert <a href="https://pypi.org/project/docopt" target="_blank" rel="nofollow noopener noreferrer"><code>docopt</code></a>'s neat syntax to <code>argparse</code> while we're at it.</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token triple-quoted-string string">"""Greetings and partings. Usage: greeter [options] [<you>] [<me>] Options: -g, --goodbye : Say "goodbye" (instead of "hello") -b, --print-bash-completion : Output a bash tab-completion script -z, --print-zsh-completion : Output a zsh tab-completion script Arguments: <you> : Your name [default: Anon] <me> : My name [default: Casper] """</span> <span class="token keyword">import</span> sys<span class="token punctuation">,</span> argopt<span class="token punctuation">,</span> shtab parser <span class="token operator">=</span> argopt<span class="token punctuation">.</span>argopt<span class="token punctuation">(</span>__doc__<span class="token punctuation">)</span> <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span> args <span class="token operator">=</span> parser<span class="token punctuation">.</span>parse_args<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> args<span class="token punctuation">.</span>print_bash_completion<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>shtab<span class="token punctuation">.</span>complete<span class="token punctuation">(</span>parser<span class="token punctuation">,</span> shell<span class="token operator">=</span><span class="token string">"bash"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> sys<span class="token punctuation">.</span>exit<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token keyword">if</span> args<span class="token punctuation">.</span>print_zsh_completion<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>shtab<span class="token punctuation">.</span>complete<span class="token punctuation">(</span>parser<span class="token punctuation">,</span> shell<span class="token operator">=</span><span class="token string">"zsh"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> sys<span class="token punctuation">.</span>exit<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> msg <span class="token operator">=</span> <span class="token string">"k thx bai!"</span> <span class="token keyword">if</span> args<span class="token punctuation">.</span>goodbye <span class="token keyword">else</span> <span class="token string">"hai!"</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"{} says '{}' to {}"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>args<span class="token punctuation">.</span>me<span class="token punctuation">,</span> msg<span class="token punctuation">,</span> args<span class="token punctuation">.</span>you<span class="token punctuation">)</span><span class="token punctuation">)</span></code></pre></div> <h3 id="try-it-out" style="position:relative;">Try it out<a href="#try-it-out" aria-label="try it out permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are many more options and features. The <a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer">documentation</a> includes examples of working with custom file completions and providing a <code>completion</code> subcommand when integrating more tightly with existing applications.</p> <p>Try it out with <code>pip install -U shtab</code> or <code>conda install -c conda-forge shtab</code>!</p> <p>Is it worth the time?</p> <p><img src="https://imgs.xkcd.com/comics/is_it_worth_the_time.png" alt=""><em>It's worth it <a href="https://xkcd.com/1205" target="_blank" rel="nofollow noopener noreferrer">xkcd#1205</a></em></p> <p><a href="https://github.com/iterative/shtab" target="_blank" rel="nofollow noopener noreferrer"><code>shtab</code></a> would be on the second row, far left (maybe even off grid). It's worth spending days to get right yet only takes seconds to install.</p>https://dvc.org/blog/first-mlops-tutorialhttps://dvc.org/blog/first-mlops-tutorialFri, 24 Jul 2020 00:00:00 GMT<p>Earlier this month, we launched <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">CML</a>, our latest open-source project in the MLOps space. We think it's a step towards establishing powerful DevOps practices (like continuous integration) as a regular fixture of machine learning and data science projects. But there are plenty of challenges ahead, and a big one is <em>literacy</em>.</p> <p>So many data scientists, like developers, are self-taught. Data science degrees have only recently emerged on the scene, which means if you polled a handful of senior-level data scientists, there'd almost certainly be no universal training or certificate among them. Moreover, there's still no widespread agreement about what it takes to be a data scientist: is it an engineering role with a little bit of Tensorflow sprinkled on top? A title for statisticians who can code? We're not expecting an easy resolution to these existential questions anytime soon.</p> <p>In the meantime, we're starting a video series to help data scientists curious about DevOps (and developers and engineeers curious about data science!) get started. Through hands-on coding examples and use cases, we want to give data science practitioners the fundamentals to explore, use, and influence MLOps.</p> <p>The first video in this series uses a lightweight and fairly popular data science problem- building a model to predict wine quality ratings- as a playground to introduce continuous integration.</p> <p>The tutorial covers:</p> <ul> <li>Using Git-flow in a data science project (making a feature branch and pull request)</li> <li>Creating your first GitHub Action to train and evaluate a model</li> <li>Using CML to generate visual reports in your pull request summarizing model performance</li> </ul> <p>It's now up on YouTube!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/9BgIDqAzfuA?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p><a href="https://github.com/andronovhopf/wine" target="_blank" rel="nofollow noopener noreferrer">Code for the project is available online</a> so you can follow along! We also recommend checking out the <a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML docs</a> for more details, tutorials, and use cases.</p> <p>If you have questions, the best way to get in touch is by leaving a comment on the blog, video, or our <a href="https://discord.gg/bzA6uY7" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. And, we're especially interested to hear what use cases you'd like to see covered in future videos- tell us about your data science project and how you could imagine using continuous integration, and we might be able to create a video!</p>https://dvc.org/blog/devops-for-data-scientistshttps://dvc.org/blog/devops-for-data-scientistsThu, 16 Jul 2020 00:00:00 GMT<p>With the rapid evolution of machine learning (ML) in the last few years, it’s become <a href="https://towardsdatascience.com/deep-learning-isnt-hard-anymore-26db0d4749d7" target="_blank" rel="nofollow noopener noreferrer">trivially easy to begin ML experiments</a>. Thanks to libraries like <a href="https://scikit-learn.org/stable/" target="_blank" rel="nofollow noopener noreferrer">scikit-learn</a> and <a href="https://github.com/keras-team/keras" target="_blank" rel="nofollow noopener noreferrer">Keras</a>, you can make models with a few lines of code.</p> <p>But it’s harder than ever to turn data science projects into meaningful applications, like a model that informs team decisions or becomes part of a product. The typical ML project involves <a href="https://ieeexplore.ieee.org/abstract/document/8804457" target="_blank" rel="nofollow noopener noreferrer">so many distinct skill sets</a> that it’s challenging, if not outright impossible, for any one person to master them all — so hard, the rare data scientist who can also develop quality software and play engineer is called a unicorn!</p> <p>As the field matures, a lot of jobs are going to require a mix of software, engineering, and mathematical chops. Some say <a href="https://www.anaconda.com/state-of-data-science-2020?utm_medium=press&utm_source=anaconda&utm_campaign=sods-2020&utm_content=report" target="_blank" rel="nofollow noopener noreferrer">they</a> <a href="http://veekaybee.github.io/2019/02/13/data-science-is-different/" target="_blank" rel="nofollow noopener noreferrer">already</a> <a href="https://tech.trivago.com/2018/12/03/teardown-rebuild-migrating-from-hive-to-pyspark/" target="_blank" rel="nofollow noopener noreferrer">do</a>.</p> <p>To quote the unparalleled data scientist/engineer/critical observer Vicki Boykis in her blog <a href="http://veekaybee.github.io/2019/02/13/data-science-is-different/" target="_blank" rel="nofollow noopener noreferrer">Data science is different now</a>:</p> <blockquote> <p>What is becoming clear is that, in the late stage of the hype cycle, data science is asymptotically moving closer to engineering, and the <a href="https://www.youtube.com/watch?v=frQeK8xo9Ls" target="_blank" rel="nofollow noopener noreferrer">skills that data scientists need</a> moving forward are less visualization and statistics-based, and <a href="https://tech.trivago.com/2018/12/03/teardown-rebuild-migrating-from-hive-to-pyspark/" target="_blank" rel="nofollow noopener noreferrer">more in line with traditional computer science curricula</a>.</p> </blockquote> <h2 id="why-data-scientists-need-to-know-about-devops" style="position:relative;">Why data scientists need to know about DevOps<a href="#why-data-scientists-need-to-know-about-devops" aria-label="why data scientists need to know about devops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>So which of the many, many engineering and software skills should data scientists learn? My money is on DevOps. DevOps, a portmanteau of development and operations, was officially born in 2009 <a href="https://en.wikipedia.org/wiki/DevOps#History" target="_blank" rel="nofollow noopener noreferrer">at a Belgian conference</a>. The meeting was convened as a response to tensions between two facets of tech organizations that historically experienced deep divisions. Software developers needed to move fast and experiment often, while Operations teams prioritized stability and availability of services (these are the people who keep servers running day in and day out). Their goals were not only opposing, they were competing.</p> <p>That sounds awfully reminiscent of today’s data science. Data scientists create value by experiments: new ways of modeling, combining, and transforming data. Meanwhile, the organizations that employ data scientists are incentivized for stability.</p> <p>The consequences of this division are profound: in the <a href="https://www.globenewswire.com/news-release/2020/06/30/2055578/0/en/Anaconda-Releases-2020-State-of-Data-Science-Survey-Results.html" target="_blank" rel="nofollow noopener noreferrer">latest Anaconda “State of Data Science” report</a>, “fewer than half (48%) of respondents feel they can demonstrate the impact of data science” on their organization. By some estimates, the vast majority of <a href="https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/" target="_blank" rel="nofollow noopener noreferrer">models created by data scientists end up stuck on a shelf</a>. We don’t yet have strong practices for passing models between the teams that create them and the teams that deploy them. Data scientists and the developers and engineers who implement their work have entirely different tools, constraints, and skill sets.</p> <p>DevOps emerged to combat this sort of deadlock in software, back when it was developers vs. operations. And it was tremendously successful: <a href="http://engineering.microsoft.com/devops/" target="_blank" rel="nofollow noopener noreferrer">many</a> <a href="https://insights.sei.cmu.edu/devops/2015/02/devops-case-study-amazon-aws.html" target="_blank" rel="nofollow noopener noreferrer">teams</a> have gone from deploying new code every few months to several times a day. Now that we have machine learning vs. operations, it’s time to think about MLOps — principles from DevOps that work for data science.</p> <h2 id="introducing-continuous-integration" style="position:relative;">Introducing Continuous Integration<a href="#introducing-continuous-integration" aria-label="introducing continuous integration permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DevOps is both a philosophy and a set of practices, including:</p> <ol> <li> <p>Automate everything you can</p> </li> <li> <p>Get feedback on new ideas fast</p> </li> <li> <p>Reduce manual handoffs in your workflow</p> </li> </ol> <p>In a typical data science project, we can see some applications:</p> <ol> <li> <p><strong>Automate everything you can.</strong> Automate parts of your data processing, model training, and model testing that are repetitive and predictable.</p> </li> <li> <p><strong>Get feedback on new ideas fast.</strong> When your data, code, or software environment changes, test it immediately in a production-like environment (meaning, a machine with the dependencies and constraints you anticipate having in production).</p> </li> <li> <p><strong>Reduce manual handoffs in your workflow.</strong> Find opportunities for data scientists to test their own models as much as possible. Don’t wait until a developer is available to see how the model will behave in a production-like environment.</p> </li> </ol> <p>The standard DevOps approach for accomplishing these goals is a method called continuous integration (CI).</p> <p>The gist is that when you change a project’s source code (usually, changes are registered via git commits), your software is automatically built and tested. Every action triggers feedback. CI is often used with <a href="https://nvie.com/posts/a-successful-git-branching-model/" target="_blank" rel="nofollow noopener noreferrer">Git-flow</a>, a development architecture in which new features are built on Git branches (need a Git refresher? <a href="https://towardsdatascience.com/why-git-and-how-to-use-git-as-a-data-scientist-4fa2d3bdc197" target="_blank" rel="nofollow noopener noreferrer">Try this</a>). When a feature branch passes the automated tests, it becomes a candidate to be merged into the master branch.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9686e1522b8cfdc441dd2fff2c34db15/39600/basic_ci_system.png" alt="basic ci system" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Here's what continuous integration looks like in software development.</em></p> <p>With this setup, we have automation — code changes trigger an automatic build followed by testing. We have fast feedback, because we get test results back quickly, so the developer can keep iterating on their code. And because all this happens automatically, you don’t need to wait for anyone else to get feedback — one less handoff!</p> <p><em>So why don’t we use continuous integration already in ML?</em> Some reasons are cultural, like a low crossover between data science and software engineering communities. Others are technical- for example, to understand your model’s performance, you need to look at metrics like accuracy, specificity, and sensitivity. You might be assisted by data visualizations, like a confusion matrix or loss plot. So pass/fail tests won’t cut it for feedback. Understanding if a model is improved requires some domain knowledge about the problem at hand, so test results need to be reported in an efficient and human-interpretable way.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c6eab1d9783382564176cf970c5956b1/39600/ci_for_data_system.png" alt="ci for data system" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Here's what continuous integration might look like in a machine learning project. Inspected by Data Science Doggy.</em></p> <h2 id="how-do-ci-systems-work" style="position:relative;">How do CI systems work?<a href="#how-do-ci-systems-work" aria-label="how do ci systems work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now we’ll get even more practical. Let’s take a look at how a typical CI system works. Luckily for learners, the barrier has never been lower thanks to tools like GitHub Actions and GitLab CI- they have clear graphical interfaces and excellent docs geared for first-time users. Since GitHub Actions is completely free for public projects, we’ll use it for this example. It works like this:</p> <ol> <li>You create a GitHub repository. You create a directory called <code>.github/workflows</code>, and inside, you place a special <code>.yaml</code> file with a script you want to run- like,</li> </ol> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">python</span> train.py</span></code></pre></div> <ol start="2"> <li>You change the files in your project repository somehow and Git commit the change. Then, push to your GitHub repository.</li> </ol> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Create a new git branch for experimenting</span> <span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> <span class="token parameter variable">-b</span> <span class="token string">"experiment"</span> </span><span class="token line"><span class="token input">$ </span><span class="token command">edit</span> train.py </span> <span class="token comment"># git add, commit, and push your changes</span> <span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> <span class="token operator">&&</span> commit <span class="token parameter variable">-m</span> <span class="token string">"Normalized features"</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git push</span> origin experiment</span></code></pre></div> <ol start="3"> <li> <p>As soon as GitHub detects the push, GitHub deploys one of their computers to run the functions in your <code>.yaml</code>.</p> </li> <li> <p>GitHub returns a notification if the functions ran successfully or not.</p> </li> </ol> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/990319ad51b933539c46b3cb7622541d/39600/run_notification.png" alt="run notification" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Find this in the Actions tab of your GitHub repository.</em></p> <p>That’s it! What’s really neat here is that you’re using GitHub’s computers to run your code. All you have to do is update your code and push the change to your repository, and the workflow happens automatically.</p> <p>Back to that special <code>.yaml</code> file I mentioned in Step 1- let’s take a quick look at one. It can have any name you like, as long as the file extension is <code>.yaml</code> and it’s stored in the directory <code>.github/workflows</code>. Here’s one:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># .github/workflows/ci.yaml</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>my<span class="token punctuation">-</span>model <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span> <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> training <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> pip install -r requirements.txt python train.py</span></code></pre></div> <p>There’s a lot going on, but most of it is the same from Action to Action- you can pretty much copy and paste this standard GitHub Actions template, but fill in your workflow in the <code>run</code> field.</p> <p>If this file is in your project repo, whenever GitHub detects a change to your code (registered via a push), GitHub Actions will deploy an Ubuntu runner and attempt to execute your commands to install requirements and run a Python script. Be aware that you have to have the files required for your workflow — here, <code>requirements.txt</code> and <code>train.py</code> — in your project repo!</p> <h2 id="get-better-feedback" style="position:relative;">Get better feedback<a href="#get-better-feedback" aria-label="get better feedback permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As we alluded to earlier, automatic training is pretty cool and all, but it’s important to have the results in a format that’s easy to understand. Currently, GitHub Actions gives you access to the runner’s logs, which are plain text.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f2f16dc29109a1fecb2b327d4738b8a6/39600/github_actions_log.png" alt="github actions log" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>An example printout from a GitHub Actions log.</em></p> <p>But understanding your model’s performance is tricky. Models and data are high dimensional and often behave nonlinearly — two things that are especially hard to understand without pictures!</p> <p>I can show you one approach for putting data viz in the CI loop. For the last few months, my team at Iterative.ai has been working on a toolkit to help use GitHub Actions and GitLab CI for machine learning projects. It’s called <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">Continuous Machine Learning</a> (CML for short), and it’s open source and free.</p> <p>Working from the basic idea of, “Let’s use GitHub Actions to train ML models,”, we’ve built some functions to give more detailed reports than a pass/fail notification. CML helps you put images and tables in the reports, like this confusion matrix generated by SciKit-learn:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/8e4cf67da97031a136fc7af36fee9520/39600/cml_basic_report.png" alt="cml basic report" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>This report appears when you make a Pull Request in GitHub!</em></p> <p>To make this report, our GitHub Action executed a Python model training script, and then used CML functions to write our model accuracy and confusion matrix to a markdown document. Then CML passed the markdown document to GitHub.</p> <p>Our revised <code>.yaml</code> file contains the following workflow:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">name</span><span class="token punctuation">:</span> train<span class="token punctuation">-</span>my<span class="token punctuation">-</span>model <span class="token key atrule">on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>push<span class="token punctuation">]</span> <span class="token key atrule">jobs</span><span class="token punctuation">:</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token key atrule">runs-on</span><span class="token punctuation">:</span> <span class="token punctuation">[</span>ubuntu<span class="token punctuation">-</span>latest<span class="token punctuation">]</span> <span class="token key atrule">container</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token key atrule">steps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> <span class="token key atrule">uses</span><span class="token punctuation">:</span> actions/checkout@v2 <span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> training <span class="token key atrule">env</span><span class="token punctuation">:</span> <span class="token key atrule">repo_token</span><span class="token punctuation">:</span> $<span class="token punctuation">{</span><span class="token punctuation">{</span> secrets.GITHUB_TOKEN <span class="token punctuation">}</span><span class="token punctuation">}</span> <span class="token key atrule">run</span><span class="token punctuation">:</span> <span class="token punctuation">|</span><span class="token scalar string"> # train.py outputs metrics.txt and plot.png pip3 install -r requirements.txt python train.py</span> <span class="token comment"># copy the contents of metrics.txt to our markdown report</span> cat metrics.txt <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token comment"># add our confusion matrix to report.md</span> cml publish plot.png <span class="token punctuation">-</span><span class="token punctuation">-</span>md <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token comment"># send the report to GitHub for display</span> cml send<span class="token punctuation">-</span>comment report.md</code></pre></div> <p>You can see the entire <a href="https://github.com/iterative/cml_base_case" target="_blank" rel="nofollow noopener noreferrer">project repository here</a>. Note that our .yaml now contains a few more configuration details, like a special Docker container and an environmental variable, plus some new code to run. The container and environmental variable details are standard in every CML project, not something the user needs to manipulate, so focus on the code!</p> <p>With the addition of these CML functions to the workflow, we’ve created a more complete feedback loop in our CI system:</p> <ol> <li> <p>Make a Git branch and change your code on that branch.</p> </li> <li> <p>Automatically train model and produce metrics (accuracy) and a visualization (confusion matrix).</p> </li> <li> <p>Embed those results in a visual report in your Pull Request.</p> </li> </ol> <p>Now, when you and your teammates are deciding if your changes have a positive effect on your modeling goals, you have a dashboard of sorts to review. Plus, this report is linked by Git to your exact project version (data and code) AND the runner used for training AND the logs from that run. Very thorough! No more graphs floating around your workspace that have long ago lost any connection to your code!</p> <p>So that’s the basic idea of CI in a data science project. To be clear, this example is among the simplest way to work with CI. In real life, you’ll likely encounter considerably more complex scenarios. CML also has features to help you use large datasets stored outside your GitHub repository (using DVC) and train on cloud instances, instead of the default GitHub Actions runners. That means you can use GPUs and other specialized setups.</p> <p>For example, I made a project using GitHub Actions to deploy an <a href="https://github.com/iterative/cml_cloud_case" target="_blank" rel="nofollow noopener noreferrer">EC2 GPU and then train a neural style transfer model</a>. Here’s my CML report:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/cd248dcfa2d85a511c3c095948ed83c9/39600/cloud_report.png" alt="cloud report" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Training in the cloud! Weeeeeee!</em></p> <p>You can also use your own Docker containers, so you can closely emulate the environment of a model in production. I’ll be blogging more about these advanced use cases in the future.</p> <h2 id="final-thoughts-on-ci-for-ml" style="position:relative;">Final thoughts on CI for ML<a href="#final-thoughts-on-ci-for-ml" aria-label="final thoughts on ci for ml permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To summarize what we’ve said so far:</p> <p><strong>DevOps is not a specific technology, but a philosophy and a set of principles and practices for fundamentally restructuring the process of creating software.</strong> It’s effective because it <strong>addresses systemic bottlenecks</strong> in how teams work and experiment with new code.</p> <p>As data science matures in the coming years, people who understand how to apply DevOps principles to their machine learning projects will be a valuable commodity — both in terms of salary and their organizational impact. Continuous integration is a staple of DevOps and one of the most effective known methods for building a culture with reliable automation, fast testing, and autonomy for teams.</p> <p>CI can be implemented with systems like <a href="https://github.com/features/actions" target="_blank" rel="nofollow noopener noreferrer">GitHub Actions</a> or <a href="https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/" target="_blank" rel="nofollow noopener noreferrer">GitLab CI</a>, and you can use these services to build automatic model training systems. The benefits are numerous:</p> <ol> <li> <p>Your code, data, models, and training infrastructure (hardware and software environment) are Git versioned.</p> </li> <li> <p>You’re automating work, testing frequently and getting fast feedback (with visual reports if you use CML). In the long run, this will almost certainly speed up your project’s development.</p> </li> <li> <p>CI systems make your work is visible to everyone on your team. No one has to search very hard to find the code, data, and model from your best run.</p> </li> </ol> <p>And I promise, once you get into the groove, it is incredibly fun to have your model training, recording, and reporting automatically kicked off by a single git commit.</p> <p>You will feel so cool.</p> <p><img src="https://media.giphy.com/media/26AHG5KGFxSkUWw1i/giphy.gif" alt="Pixel Illustration GIF by Walter Newton"></p> <h3 id="further-reading" style="position:relative;">Further reading<a href="#further-reading" aria-label="further reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li> <p><a href="https://www.martinfowler.com/articles/continuousIntegration.html" target="_blank" rel="nofollow noopener noreferrer">Continuous Integration</a>, the seminal Martin Fowler blog on the subject</p> </li> <li> <p><a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">Continuous Delivery for Machine Learning</a>, another excellent blog on Martin Fowler’s site about building a continuous integration & continuous delivery system for ML</p> </li> <li> <p><a href="https://www.amazon.com/DevOps-Handbook-Second-World-Class-Organizations/dp/B09L56CT6N" target="_blank" rel="nofollow noopener noreferrer">The DevOps Handbook</a>, a beloved guide that is recommended for nearly any organization (ML, software, or not)</p> </li> </ul> <p><em><strong>Note:</strong> This article has been cross-posted on Medium.</em></p>https://dvc.org/blog/july-20-dvc-heartbeathttps://dvc.org/blog/july-20-dvc-heartbeatFri, 10 Jul 2020 00:00:00 GMT<p>Welcome to the July Heartbeat, our monthly roundup of <a href="#news">new releases</a>, <a href="#community-activity">talks</a>, <a href="#good-reads">great articles</a>, and <a href="#coming-up-soon">upcoming events</a> in the DVC community.</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="dvc-10-release" style="position:relative;">DVC 1.0 release<a href="#dvc-10-release" aria-label="dvc 10 release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>On June 22, DVC entered a new era: the <a href="https://dvc.org/blog/dvc-1-0-release" target="_blank" rel="nofollow noopener noreferrer">official release of version 1.0</a>. After several weeks of bug-catching with our pre-release, the team has issued DVC 1.0 for the public! Now when you <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">install DVC through your package manager of choice</a>, you'll get the latest version. Welcome to the future.</p> <p>To recap, DVC 1.0 has some big new features like:</p> <ul> <li>Plots powered by Vega-Lite so you can compare metrics across commits</li> <li>New and easier pipeline configuration files- edit your DVC pipeline like a text file!</li> <li>Optimizations for data transfer speed</li> </ul> <p>Read all the <a href="https://dvc.org/blog/dvc-1-0-release" target="_blank" rel="nofollow noopener noreferrer">release notes</a> for more, and stop by our <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord</a> if you need support migrating (don't worry, 1.0 is backwards compatible).</p> <h3 id="virtual-meetup" style="position:relative;">Virtual meetup!<a href="#virtual-meetup" aria-label="virtual meetup permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In May, we had our <a href="https://dvc.org/blog/may-20-dvc-heartbeat">first every virtual meetup</a>. We had amazing talks from <a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a> and <a href="https://github.com/ehutt" target="_blank" rel="nofollow noopener noreferrer">Elizabeth Hutton</a>, plus time for Q&A with the DVC team- you can <a href="https://www.youtube.com/watch?v=19GMtrFykSU&list=PLVeJCYrrCemiOc1SS_PIB3Tb3HX0Aqw3j" target="_blank" rel="nofollow noopener noreferrer">watch the recording</a> if you missed it!</p> <p>On Thursday, July 30, we're hosting our second meetup! Ambassador <a href="http://mribeirodantas.me/" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a> is hosting once again. We'll have short talks about causal modeling and CI/CD, plus lots of time for chatting and catching up. Please RSVP!</p> <blockquote class="embedly-card"><h4><a href="https://www.meetup.com/DVC-Community-Virtual-Meetups/events/271844501/">July DVC Meetup: Data Science & DevOps!</a></h4><p>This meetup will be hosted by DVC Ambassador Marcel! AGENDA:We have two 10-minute talks on the agenda:- Causal Modeling with DVC - Marcel- Continuous integration for ML case studies - Elle Following talks, we'll have Q&A with the DVC team and time for community discussion.</p></blockquote> <script async src="//cdn.embedly.com/widgets/platform.js" charset="UTF-8"></script> <h3 id="dvc-is-in-the-top-20-fastest-growing-open-source-startups" style="position:relative;">DVC is in the top 20 fastest-growing open source startups<a href="#dvc-is-in-the-top-20-fastest-growing-open-source-startups" aria-label="dvc is in the top 20 fastest growing open source startups permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Konstantin Vinogradov at <a href="https://runacap.com/" target="_blank" rel="nofollow noopener noreferrer">Runa Capital</a> used the GitHub API to <a href="https://medium.com/runacapital/open-source-growth-benchmarks-and-the-20-fastest-growing-oss-startups-d3556a669fe6" target="_blank" rel="nofollow noopener noreferrer">identify the fastest growing public repositories on GitHub</a> in terms of stars and forks. He used these metrics to estimate the top 20 fastest growing startups in open source software. And guess what, DVC made the cut! We're in great company.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f0727e088aa19b3e291c39749796bcf5/39600/top20startups.png" alt="top20startups" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="new-team-member" style="position:relative;">New team member<a href="#new-team-member" aria-label="new team member permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We have a new teammate-<a href="https://www.linkedin.com/in/mvshmakov/" target="_blank" rel="nofollow noopener noreferrer">Maxim Shmakov</a>, previously of Yandex, is joining us! Maxim is a front-end engineer joining us from Moscow. Please welcome him to DVC. 👋</p> <h2 id="community-activity" style="position:relative;">Community activity<a href="#community-activity" aria-label="community activity permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We've been busy! Although we are mostly homebound these days, there has been no shortage of speaking engagements. Here's a recap.</p> <h3 id="meetings-and-talks" style="position:relative;">Meetings and talks<a href="#meetings-and-talks" aria-label="meetings and talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li>Co-founders Dmitry and Ivan appeared on the HasGeek TV series <a href="https://hasgeek.com/fifthelephant/making-data-science-work-session-3/" target="_blank" rel="nofollow noopener noreferrer">Making Data Science Work</a> to discuss engineering for data science with hosts <a href="https://www.linkedin.com/in/pingali/" target="_blank" rel="nofollow noopener noreferrer">Venkata Pingali</a> and <a href="https://www.linkedin.com/in/indrayudhghoshal/" target="_blank" rel="nofollow noopener noreferrer">Indrayudh Ghoshal</a>. The livestream is available for viewing on YouTube!</li> </ul> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/EWcpALbzZRg?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <ul> <li>Dmitry appeared on the <a href="https://mlops.community/" target="_blank" rel="nofollow noopener noreferrer">MLOps.community</a> meetup to chat with host <a href="https://www.linkedin.com/in/dpbrinkm/" target="_blank" rel="nofollow noopener noreferrer">Demetrios Brinkmann</a>. They talked about the open source ecosystem, the difference between tools and platforms, and what it means to codify data.</li> </ul> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/ojV1tK9jXH8?rel=0&%3B=&%3Bshowinfo=0%3B&start=2295" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <ul> <li>I (Elle) gave a talk at the <a href="https://mlopsworld.com/" target="_blank" rel="nofollow noopener noreferrer">MLOps Production & Engineering World</a> meeting, called "Adapting continuous integration and continuous delivery for ML". I shared an approach to using GitHub Actions with ML projects. Video coming soon!</li> </ul> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Elle O'Brien is currently explaining the adaptation of continuous integration and continuous delivery for ML at <a href="https://twitter.com/hashtag/MLOPS2020?src=hash&ref_src=twsrc%5Etfw">#MLOPS2020</a>!<br><br>From explaining DVC to providing great examples - a very interesting talk with @andronovhopf taking place right now! <a href="https://t.co/dJjuLb0k4F">pic.twitter.com/dJjuLb0k4F</a></p>— Toronto Machine Learning Society (TMLS) (@TMLS_TO) <a href="https://twitter.com/TMLS_TO/status/1273693487104503808">June 18, 2020</a></blockquote> <ul> <li>Extremely early the next morning, clinician-scientist <a href="https://www.linkedin.com/in/crislanting/?originalSubdomain=nl" target="_blank" rel="nofollow noopener noreferrer">Cris Lanting</a> and I co-led a workshop about developing strong computational infrastructure and practices in research as part of the <a href="https://computationalaudiology.com/" target="_blank" rel="nofollow noopener noreferrer">Virtual Conference on Computational Audiology</a>. We talked about big ideas for making scientific research reproducible, manageable, and shareable. For the curious, the workshop is still viewable!</li> </ul> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/W4CoptalWw0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <ul> <li>DVC has a virtual poster at <a href="https://www.scipy2020.scipy.org/" target="_blank" rel="nofollow noopener noreferrer">SciPy 2020</a>! We prepared a demo about <a href="https://dvc.org/blog/scipy-2020-dvc-poster" target="_blank" rel="nofollow noopener noreferrer">packaging models and datasets like software</a> so they can be widely disseminated via GitHub.</li> </ul> <h3 id="good-reads" style="position:relative;">Good reads<a href="#good-reads" aria-label="good reads permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Some excellent reading recommendations from the community:</p> <ul> <li>Data scientist Déborah Mesquita published a thorough guide to using new DVC 1.0 pipelines in a sample ML project. It's truly complete, covering data collection to model evaluation, with detailed code examples. If you are new to pipelines, do not miss this!</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/the-ultimate-guide-to-building-maintainable-machine-learning-pipelines-using-dvc-a976907b2a1b" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The ultimate guide to building maintainable Machine Learning pipelines using DVC</h4> <div class="elp-description">Learn the principles for building maintainable Machine Learning pipelines using DVC</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-07-10/pipes-897b33e8338e2f7b20b08ba2a175c2d7.jpg" alt="The ultimate guide to building maintainable Machine Learning pipelines using DVC"> </div> </a> </section> <p></p> <ul> <li>Caleb Kaiser of <a href="https://github.com/cortexlabs/cortex" target="_blank" rel="nofollow noopener noreferrer">Cortex</a> (another startup in the Runa Capital's Top 20 list!) shared a thinkpiece about challenges from software engineering that can inform production ML. We really agree with what he has to say about reproducibility:</li> </ul> <blockquote> <p>You typically hear about “reproducibility” in reference to ML research, particularly when a paper doesn’t include enough information to recreate the experiment. However, reproducibility also comes up a lot in production ML. Think of it this way — you’re on a team staffed with data scientists and engineers, and you’re all responsible for an image classification API. The data scientists are constantly trying new techniques and architectural tweaks to improve the model’s baseline performance, while at the same time, the model is constantly being retrained on new data. Looking over the APIs performance, you see one moment a week ago where the model’s performance dropped significantly. What caused that drop? Without knowing exactly how the model was trained, and on what data, it’s impossible to know for sure.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/what-software-engineers-can-bring-to-machine-learning-25f458c80e5" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">What software engineers can bring to machine learning</h4> <div class="elp-description">Many production machine learning challenges are paralleled in software engineering</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-07-10/tds-fec2edf6c4678dec52338b870f8ce9c6.jpg" alt="What software engineers can bring to machine learning"> </div> </a> </section> <p></p> <ul> <li>Mukul Sood wrote about the Real World, a place beyond Jupyter notebooks where data is non-stationary and servers are unreliable! He covers some very real challenges for taking a data science project into production and introduces the need for CI/CD practices in healthy, scalable ML applications.</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/scaling-machine-learning-in-real-world-cb601b2baf4a" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Scaling Machine Learning in the Real World</h4> <div class="elp-description">Any conversation around scaling or productionizing data science, would need to talk about Continuous Integration/Continuous Deployment.</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-07-10/storm-eda2b016d57c4d44498606e0ce461437.jpg" alt="Scaling Machine Learning in the Real World"> </div> </a> </section> <p></p> <h3 id="a-nice-tweet" style="position:relative;">A nice tweet<a href="#a-nice-tweet" aria-label="a nice tweet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We'll close on a nice tweet from <a href="https://datasyndrome.com/" target="_blank" rel="nofollow noopener noreferrer">Russell Jurney</a>:</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I have to say I am blown out of the water by <a href="https://twitter.com/DVCorg">@DVCorg</a><br><br>DVC is incredibly powerful. Right now we’re just versioning input/output datasets in DVC against S3, but even this is incredibly useful and so much better than trying Git LFS (ugh) or manual archiving.<a href="https://t.co/5bf5VJuPaE">https://t.co/5bf5VJuPaE</a></p>— Russell Jurney 🇺🇦 (@rjurney) <a href="https://twitter.com/rjurney/status/1266735603921547264">May 30, 2020</a></blockquote> <p>Thanks, we couldn't do it without our community! As always, thanks for joining us and reading. There are lots of ways to stay in touch and we always love to hear from you. Follow us on <a href="twitter.com/dvcorg">Twitter</a>, join our <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord server</a>, or leave a blog comment. Until next time! 😎</p>https://dvc.org/blog/cml-releasehttps://dvc.org/blog/cml-releaseTue, 07 Jul 2020 00:00:00 GMT<h2 id="cicd-for-machine-learning" style="position:relative;">CI/CD for machine learning<a href="#cicd-for-machine-learning" aria-label="cicd for machine learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Today, the DVC team is releasing a new open-source project called Continuous Machine Learning, or CML (<a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">https://cml.dev</a>) to mainstream the best engineering practices of CI/CD to AI and ML teams. CML helps to organize MLOps infrastructure on top of the traditional software engineering stack instead of creating separate AI platforms.</p> <p>Continuous integration and continuous delivery (CI/CD) is a widely-used software engineering practice. It's a validated approach to increasing the agility of software development without sacrificing stability. <strong>But why haven't CI/CD practices taken root in machine learning and data science so far?</strong></p> <p>We see three substantial technical barriers to using standard CI systems with machine learning projects:</p> <ol> <li><strong>Data dependencies.</strong> In ML, data plays a similar role as code: ML results critically depend on datasets, and changes in data need to trigger feedback just like changes in source code. Furthermore, multi-GB datasets are challenging to manage with Git-centric CI systems.</li> <li><strong>Metrics-driven.</strong> The traditional software engineering idea of pass/fail tests does not apply in ML. As an example, <code>+0.72% accuracy</code> and <code>-0.35% precision</code> does not answer the question if the ML model is good or not. Detailed reports with metrics and plots are needed to make a good/bad model discussion</li> <li><strong>CPU/GPU resources</strong>. ML training often requires more resources to train then is typical to have in CI/CD runners. CI/CD must be connected with cloud computing instances or Kubernetes clusters for ML training.</li> </ol> <h2 id="cicd-for-ml-is-the-next-step-for-the-dvc-team" style="position:relative;">CI/CD for ML is the next step for the DVC team<a href="#cicd-for-ml-is-the-next-step-for-the-dvc-team" aria-label="cicd for ml is the next step for the dvc team permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Since the beginning, our motivation has been helping ML teams benefit from DevOps. We started DVC because we knew that data management would be a crucial bottleneck, and sure enough, DVC was a big step towards making pipelines and experiments manageable and reproducible. But conversations with our community have brought us to one conclusion again and again: CI/CD for ML is the holy grail.</p> <p>Over the last 3 years, we've reached some big milestones:</p> <ol> <li> <p>We built DVC to address the ML data management problem. Recently, we <a href="https://dvc.org/blog/dvc-1-0-release" target="_blank" rel="nofollow noopener noreferrer">released DVC 1.0</a>, marking a new and more stable era for our API.</p> </li> <li> <p>DVC has become a core part of many ML team's daily operations. The latest <a href="https://www.thoughtworks.com/radar/tools" target="_blank" rel="nofollow noopener noreferrer">ThoughtWorks Technology Radar</a> says:</p> <p><em>"… it [DVC] has become a favorite tool for managing experiments in machine learning (ML) projects. Since it's based on Git, DVC is a familiar environment for software developers to bring their engineering practices to ML practice."</em></p> </li> <li> <p>An extraordinary team and community have emerged around DVC:</p> <ul> <li>15 employees in our organization <a href="https://iterative.ai" target="_blank" rel="nofollow noopener noreferrer">https://iterative.ai</a></li> <li>100+ open-source contributors to DVC <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc</a> and another 100+ open-source contributors to docs <a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc.org</a></li> <li>2000+ community members in our Discord <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/chat</a> and GitHub issue tracker <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/dvc</a></li> <li>4000+ regular users of DVC</li> </ul> </li> </ol> <p>Now that DVC is maturing, we're ready to take the next step: we want to revolutionize the ML development processes. We want ML experiments to have greater visibility to teammates, shorter feedback loops, and more reproducibility. We want teams to spend less time managing their computing resources and experiments, and more time building value. The goal is to extend the amazing results of DevOps from software development to ML and MLOps.</p> <h2 id="continuous-machine-learning-release" style="position:relative;"><em>Continuous Machine Learning</em> release<a href="#continuous-machine-learning-release" aria-label="continuous machine learning release permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Today, we're releasing an open-source project <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">https://cml.dev</a> to close the gap between machine learning and software development practices.</p> <p>CML is a library of functions used inside CI/CD runners to make ML compatible with <strong>GitHub Actions</strong> and <strong>GitLab CI</strong>. We've created functions to:</p> <ol> <li>Generate informative reports on every Pull/Merge Request with metrics, plots, and hyperparameters changes.</li> <li>Provision GPU\CPU resources from cloud service providers (<strong>AWS, GCP, Azure, Ali</strong>) and deploy CI runners using <a href="https://github.com/docker/machine" target="_blank" rel="nofollow noopener noreferrer">Docker Machine</a>.</li> <li>Bring datasets from cloud storage to runners (using <strong>DVC</strong>) for model training, as well as save the resulting model in cloud storage.</li> </ol> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4abfb3a481ef05b3f8a0140ede1bda90/39600/cml-report-metrics.png" alt="Auto-generated metrics-driven report in GitLab Merge Request" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>The workflow and visual reports are customizable by modifying the CI configuration file in your GitHub <code>./github/workflows/*.yaml</code> or GitLab <code>.gitlab-ci.yml</code> project. Use CML functions in conjunction with your own ML model training and testing scripts to create your own automated workflow and reporting system.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token comment"># GitLab workflow in '.gitlab-ci.yml' file</span> <span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> cml_run <span class="token key atrule">cml</span><span class="token punctuation">:</span> <span class="token key atrule">stage</span><span class="token punctuation">:</span> cml_run <span class="token key atrule">image</span><span class="token punctuation">:</span> iterativeai/cml<span class="token punctuation">:</span>0<span class="token punctuation">-</span>dvc2<span class="token punctuation">-</span>base1 <span class="token key atrule">script</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> dvc pull data <span class="token punctuation">-</span><span class="token punctuation">-</span>run<span class="token punctuation">-</span>cache <span class="token punctuation">-</span> pip install <span class="token punctuation">-</span>r requirements.txt <span class="token punctuation">-</span> dvc repro <span class="token comment"># Compare metrics to master</span> <span class="token punctuation">-</span> git fetch <span class="token punctuation">-</span><span class="token punctuation">-</span>prune <span class="token punctuation">-</span> dvc metrics diff <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>md master <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token comment"># Visualize loss function diff</span> <span class="token punctuation">-</span> dvc plots diff <span class="token punctuation">-</span><span class="token punctuation">-</span>target loss.csv <span class="token punctuation">-</span><span class="token punctuation">-</span>show<span class="token punctuation">-</span>vega master <span class="token punctuation">></span> vega.json <span class="token punctuation">-</span> vl2png vega.json <span class="token punctuation">></span> plot.png <span class="token punctuation">-</span> cml publish <span class="token punctuation">-</span><span class="token punctuation">-</span>md plot.png <span class="token punctuation">></span><span class="token punctuation">></span> report.md <span class="token punctuation">-</span> dvc push data <span class="token punctuation">-</span><span class="token punctuation">-</span>run<span class="token punctuation">-</span>cache <span class="token punctuation">-</span> cml send<span class="token punctuation">-</span>comment report.md</code></pre></div> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 614px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/0fd9c743e967356a65ffee9780c681ec/39600/cml-report-params.png" alt="Hyperparameter change with a result image in GitHub Pull request report" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>In this example all the CML functions are defined in the <strong>docker images</strong> that is used in the workflow - <code>iterativeai/cml:0-dvc2-base1</code>. Users can specify any docker image. The only restriction is that the CML library need to be installed to enable all the CML commands for the reporting and graphs:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">npm</span> i @dvcorg/cml</code></pre></div> <p>Examples of docker images can be found in <code>docker</code> directory of the CML the repository: <a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">CML repository</a>.</p> <p>As you can see, CML is based on the assumption that MLOps can work with traditional engineering tools. It shouldn't require an entirely separate platform. We're excited about a world where DevOps practitioners can work fluently on both software and ML aspects of a project.</p> <h2 id="the-relationship-between-cml-and-dvc" style="position:relative;">The relationship between CML and DVC<a href="#the-relationship-between-cml-and-dvc" aria-label="the relationship between cml and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>CML and DVC are related projects under the umbrella of the same team, but will have separate websites and independent development. The CML project is hosted on a new web site: <a href="https://cml.dev" target="_blank" rel="nofollow noopener noreferrer">https://cml.dev</a>. The source code and issue tracker is on GitHub: <a href="https://github.com/iterative/cml" target="_blank" rel="nofollow noopener noreferrer">https://github.com/iterative/cml</a></p> <p>For support and communications, the DVC Discord server is still the place to go: <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/chat</a> We've made a new <code>#cml</code> channel there to discuss CML, CI/CD for ML and other MLOps related questions.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>With the rise of AI/ML teams and ML platforms in addition to the software engineering stack, we believe that the industry needs a single technology stack to work with software as well as AI projects. A simple layer of a tool is required to close the gap between AI projects and software projects to fit them into the existing stack and CML is the way to make it.</p> <p>Our philosophy is that ML projects, and MLOps practices, should be built on top of traditional engineering tools and not as a separate stack. A simple layer of tools will be required to close the gap, and CML is part of this ecosystem. We think this is the future of MLOps.</p> <p>As always, thanks for reading and for being part of the DVC community. We'd love to hear what you think about CML. Please be in touch on <a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a>!</p>https://dvc.org/blog/june-20-community-gemshttps://dvc.org/blog/june-20-community-gemsMon, 29 Jun 2020 00:00:00 GMT<h2 id="highlights-from-discord" style="position:relative;">Highlights from Discord<a href="#highlights-from-discord" aria-label="highlights from discord permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p> <h3 id="q-i-just-upgraded-to-dvc-10-ive-got-some-pipeline-stages-currently-saved-as-dvc-files-is-there-an-easy-way-to-convert-the-old-dvc-format-to-the-new-dvcyaml-standard" style="position:relative;">Q: I just upgraded to DVC 1.0. I've got some pipeline stages currently saved as <code>.dvc</code> files. <a href="https://discord.com/channels/485586884165107732/563406153334128681/725019219930120232" target="_blank" rel="nofollow noopener noreferrer">Is there an easy way to convert the old <code>.dvc</code> format to the new <code>dvc.yaml</code> standard?</a><a href="#q-i-just-upgraded-to-dvc-10-ive-got-some-pipeline-stages-currently-saved-as-dvc-files-is-there-an-easy-way-to-convert-the-old-dvc-format-to-the-new-dvcyaml-standard" aria-label="q i just upgraded to dvc 10 ive got some pipeline stages currently saved as dvc files is there an easy way to convert the old dvc format to the new dvcyaml standard permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes! You can easily transfer the stages by hand: <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> is designed for manual edits in any text editor, so you can type your old stages in and then delete the old <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files. We also have a <a href="https://gist.github.com/skshetry/07a3e26e6b06783e1ad7a4b6db6479da" target="_blank" rel="nofollow noopener noreferrer">migration script</a> available, although we can't provide long-term support for it.</p> <p>Learn more about the <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> format in our <a href="https://dvc.org/doc/user-guide/dvc-files#dvcyaml-file" target="_blank" rel="nofollow noopener noreferrer">brand new docs</a>!</p> <p><img src="https://media.giphy.com/media/JYpTAnhT0EI2Q/giphy.gif" alt="Year Opening GIF"></p> <p><em>Just like this but with technical documentation.</em></p> <h3 id="q-after-i-pushed-my-local-data-to-remote-storage-i-noticed-the-file-names-are-different-in-my-storage-repository--theyre-hash-values-can-i-make-them-more-meaningful-names" style="position:relative;">Q: After I pushed my local data to remote storage, I noticed the file names are different in my storage repository- they're hash values. <a href="https://discord.com/channels/485586884165107732/563406153334128681/717737163122540585" target="_blank" rel="nofollow noopener noreferrer">Can I make them more meaningful names?</a><a href="#q-after-i-pushed-my-local-data-to-remote-storage-i-noticed-the-file-names-are-different-in-my-storage-repository--theyre-hash-values-can-i-make-them-more-meaningful-names" aria-label="q after i pushed my local data to remote storage i noticed the file names are different in my storage repository theyre hash values can i make them more meaningful names permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>No, but for a good reason! What you're seeing are cached files, and they're stored with a special naming convention that makes DVC versioning and addressing possible- these file names are how DVC deduplicates data (to avoid keeping multiple copies of the same file version) and ensures that each unique version of a file is immutable. If you manually overwrote those filenames you would risk breaking Git version control. You can <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">read more about how DVC uses this file format in our docs</a>.</p> <p>It sounds like you're looking for ways to interact with DVC-tracked objects at a high level of abstraction, meaning that you want to interface with the original filenames and not the machine-generated hashes used by DVC. There are a few secure and recommended ways to do this:</p> <ul> <li>If you want to see a human-readable list of files that are currently tracked by DVC, try the <a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a> command-<a href="https://dvc.org/doc/command-reference/list" target="_blank" rel="nofollow noopener noreferrer">read up on it here</a>.</li> <li>Check out our <a href="https://dvc.org/doc/use-cases/data-registries#data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry tutorial</a> to see how the commands <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> and <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> are used to download and share DVC-tracked artifacts. The syntax is built for an experience like using a package manager.</li> <li>The <a href="https://dvc.org/doc/api-reference" target="_blank" rel="nofollow noopener noreferrer">DVC Python API</a> gives you programmatic access to DVC-tracked artifacts, using human-readable filenames.</li> </ul> <h3 id="q-is-it-better-practice-to-dvc-add-data-files-individually-or-to-add-a-directory-containing-multiple-data-files" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/722141190312689675" target="_blank" rel="nofollow noopener noreferrer">Is it better practice to <code>dvc add</code> data files individually, or to add a directory containing multiple data files?</a><a href="#q-is-it-better-practice-to-dvc-add-data-files-individually-or-to-add-a-directory-containing-multiple-data-files" aria-label="q is it better practice to dvc add data files individually or to add a directory containing multiple data files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If the directory you're adding is logically one unit (for example, it is the whole dataset in your project), we recommend using <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> at the directory level. Otherwise, add files one-by-one. You can <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">read more about how DVC versions directories in our docs</a>.</p> <h3 id="q-do-you-have-any-examples-of-using-dvc-with-minio" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/722780202844815362" target="_blank" rel="nofollow noopener noreferrer">Do you have any examples of using DVC with MinIO?</a><a href="#q-do-you-have-any-examples-of-using-dvc-with-minio" aria-label="q do you have any examples of using dvc with minio permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We don't have any tutorials for this use case exactly, but it's a very straightforward modification from <a href="https://dvc.org/doc/use-cases" target="_blank" rel="nofollow noopener noreferrer">our basic use cases</a>. The key difference when using MinIO or a similar S3-compatible storage (like DigitalOcean Spaces or IBM Cloud Object Storage) is that in addition to setting remote data storage, you must set the <code>endpointurl</code> too. For example:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> myremote s3://mybucket/path/to/dir </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote endpointurl https://object-storage.example.com</span></code></pre></div> <p>Read up on configuring supported storage <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">in our docs</a>.</p> <h3 id="q-if-i-have-a-folder-containing-many-data-files-is-there-any-advantage-to-zipping-the-folder-and-dvc-tracking-the-zip" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/714922184455225445" target="_blank" rel="nofollow noopener noreferrer">If I have a folder containing many data files, is there any advantage to zipping the folder and DVC tracking the <code>.zip</code>?</a><a href="#q-if-i-have-a-folder-containing-many-data-files-is-there-any-advantage-to-zipping-the-folder-and-dvc-tracking-the-zip" aria-label="q if i have a folder containing many data files is there any advantage to zipping the folder and dvc tracking the zip permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are a few things to consider:</p> <ul> <li> <p><strong>CPU time.</strong> Even though it can be faster to pull a single file than a directory (though not in all cases, since we can parallelize directory downloads), the tradeoff is the time needed to unzip your data. Depending on your constraints, this can be expensive and undesirable.</p> </li> <li> <p><strong>Deduplication.</strong> DVC deduplicates on the file level. So if you add one single file to a directory, DVC will save only that file, not the whole dataset again. If you use a zipped directory you won't get this benefit. In the long run, this could be more expensive in terms of storage space for your DVC cache and remote if the contents of your dataset change frequently.</p> </li> </ul> <p>Generally, we would recommend first trying a plain unzipped directory. DVC is designed to work with large numbers of files (on the order of millions) and has the latest release (DVC 1.0) has <a href="https://dvc.org/blog/dvc-1-0-release#data-transfer-optimizations" target="_blank" rel="nofollow noopener noreferrer">optimizations built for this purpose exactly</a>.</p> <h3 id="q-can-i-execute-a-dvc-push-with-the-dvc-python-api-inside-a-python-script" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/718419219288686664" target="_blank" rel="nofollow noopener noreferrer">Q: Can I execute a <code>dvc push</code> with the DVC Python API inside a Python script?</a><a href="#q-can-i-execute-a-dvc-push-with-the-dvc-python-api-inside-a-python-script" aria-label="q can i execute a dvc push with the dvc python api inside a python script permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Currently, our <a href="https://dvc.org/doc/api-reference#python-api" target="_blank" rel="nofollow noopener noreferrer">Python API</a> doesn't support commands like <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>,<a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>, or <a href="https://dvc.org/doc/command-reference/status"><code>dvc status</code></a>. It is designed for interfacing with objects tracked by DVC. That said, CLI commands are basically calling <code>dvc.repo.Repo</code> object methods. So if you want to use commands from within Python code, you could try creating a <code>Repo</code> object with <code>r = Repo({root_dir})</code> and then <code>r.push()</code>. Please note that we don't officially support this use case yet.</p> <p>Of course, you can also run DVC commands from a Python script using <code>sys</code> or a similar library for issuing system commands.</p> <h3 id="q-does-the-dvc-pipeline-command-for-visualizing-pipelines-still-work-in-dvc-10" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/717682556203565127" target="_blank" rel="nofollow noopener noreferrer">Q: Does the <code>dvc pipeline</code> command for visualizing pipelines still work in DVC 1.0?</a><a href="#q-does-the-dvc-pipeline-command-for-visualizing-pipelines-still-work-in-dvc-10" aria-label="q does the dvc pipeline command for visualizing pipelines still work in dvc 10 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Most of the <code>dvc pipeline</code> functionality- like <code>dvc pipeline show --ascii</code> to print out an ASCII diagram of your pipeline- has been migrated to a new command, <a href="https://dvc.org/doc/command-reference/dag"><code>dvc dag</code></a>. This function is written for our new pipeline format. Check out <a href="https://dvc.org/doc/command-reference/dag#dag" target="_blank" rel="nofollow noopener noreferrer">our new docs</a> for an example.</p> <h3 id="q-is-there-a-way-to-create-a-dvc-pipeline-stage-without-running-the-commands-in-that-stage" style="position:relative;"><a href="https://discord.com/channels/485586884165107732/485596304961962003/715271980978405447" target="_blank" rel="nofollow noopener noreferrer">Q: Is there a way to create a DVC pipeline stage without running the commands in that stage?</a><a href="#q-is-there-a-way-to-create-a-dvc-pipeline-stage-without-running-the-commands-in-that-stage" aria-label="q is there a way to create a dvc pipeline stage without running the commands in that stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes. Say you have a Python script, <code>train.py</code>, that takes in a dataset <code>data</code> and outputs a model <code>model.pkl</code>. To create a DVC pipeline stage corresponding to this process, you could do so like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-n</span> train </span> -d train.py -d data -o model.pkl python train.py</code></pre></div> <p>However, this would automatically rerun the command <code>python train.py</code>, which is not necessarily desirable if you have recently run it, the process is time consuming, and the dependencies and outputs haven't changed. You can use the <code>--no-exec</code> flag to get around this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">--no-exec</span> </span> -n train -d train.py -d data -o model.pkl python train.py</code></pre></div> <p>This flag can also be useful when you want to define the pipeline on your local machine but plan to run it later on a different machine (perhaps an instance in the cloud). <a href="https://dvc.org/doc/command-reference/run" target="_blank" rel="nofollow noopener noreferrer">Read more about the <code>--no-exec</code> flag in our docs.</a></p> <p>One other approach worth mentioning is that you can manually edit your <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file to add a stage. If you add a stage this way, pipeline commands won't be executed until you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>.</p>https://dvc.org/blog/scipy-2020-dvc-posterhttps://dvc.org/blog/scipy-2020-dvc-posterFri, 26 Jun 2020 00:00:00 GMT<p>When I was doing my Ph.D., every time I published a paper I shared a public GitHub repository with my dataset and scripts to reproduce my statistical analyses. While it took a bit of work to get the repository in good shape for sharing (cleaning up code, adding documentation), the process was straightforward: upload everything to the repo!</p> <p>But when I started working on deep learning projects, things got considerably more complicated. For example, in a <a href="https://pudding.cool/2019/11/big-hair/" target="_blank" rel="nofollow noopener noreferrer">data journalism project I did with The Pudding</a>, I wanted to understand how hair style (particularly size!) changed over the years. There were a lot of moving parts:</p> <ul> <li>A public dataset of yearbook photos released and maintained by <a href="https://people.eecs.berkeley.edu/~shiry/projects/yearbooks/yearbooks.html" target="_blank" rel="nofollow noopener noreferrer">Ginosar et al.</a></li> <li>A deep learning model I trained to segment the hair in yearbook photos</li> <li>A derivative dataset of "hair maps" for each photo in the original datasetr</li> <li>All the code to train the deep learning model and analyse the derivative dataset</li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b058a7be71c126ec336b730fa3dc7718/39600/hairflow.png" alt="hairflow" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>The parts of my big-hair-data project: an original public dataset, a model for segmenting the images, a derivative dataset of segment maps, and analysis scripts.</em></p> <p>How would you share this with a collaborator, or open it up to the public? Throwing it all in a GitHub repository was not an option. My model wouldn't fit on GitHub because it was over the 100 MB size limit. I also wanted to preserve a clear link between my derived dataset and the original- it should be obvious exactly how I got the public dataset. And if that public dataset were to ever change, I would ideally want it to be clear what version I used for my analyses.</p> <p>This blog is about several different ways of "releasing" data science projects, with an emphasis on preserving meaningful links about the origins of derived data and models. I'm not making any strong assumptions about whether project materials are relased within an organization (only to teammates, for example) or to the whole internet.</p> <p>Let's look at a few methods.</p> <h1 id="method-one-artifacts-in-the-cloud" style="position:relative;">Method One: artifacts in the cloud<a href="#method-one-artifacts-in-the-cloud" aria-label="method one artifacts in the cloud permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>When you work with big models and datasets, you often can't host them in a GitHub repo. But you can put them in cloud storage, and then provide a script in your GitHub repo to download them. For example, in the fantastic <code>gpt-2-simple</code> <a href="https://github.com/minimaxir/gpt-2-simple" target="_blank" rel="nofollow noopener noreferrer">project by Max Woolf</a>, Max stores huge GPT-2 models in Google Drive and provides a script to download a specified model to a user's local workspace if it isn't already there.</p> <p>Likewise, the <a href="https://github.com/NVlabs/stylegan" target="_blank" rel="nofollow noopener noreferrer">Nvidia StyleGAN release</a> provides a hardcoded URL to their model in Google Drive storage. Both the <code>gpt-2-simple</code> and StyleGAN projects have custom scripts to handle these big downloads, and largely thanks to the work of the project maintainers, users only interact with the downloading process at a very high level.</p> <p>Considering some pros and cons of this approach:</p> <table><thead><tr><th align="center"><strong>Pros</strong></th><th align="center"><strong>Cons</strong></th></tr></thead><tbody><tr><td align="center">It's easy to put a model in a bucket</td><td align="center">Hardcoded links are brittle</td></tr><tr><td align="center">Works for pip packages</td><td align="center">Need to write custom functions</td></tr><tr><td align="center">No extra tools, just Python scripting</td><td align="center">Downloads aren't versioned</td></tr></tbody></table> <h1 id="method-two-hubs-catalogs--zoos" style="position:relative;">Method Two: Hubs, Catalogs & Zoos<a href="#method-two-hubs-catalogs--zoos" aria-label="method two hubs catalogs zoos permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>There are a (growing) number of websites willing to long-term host big models and datasets, plus relevant meta-data, code, and publications. Some even allow you to upload several versions of a project- it's not Git, for sure, but even basic version control is something.</p> <p>For example, <a href="https://pytorch.org/hub/" target="_blank" rel="nofollow noopener noreferrer">PyTorch Hub</a> lets researchers publish trained models developed in the PyTorch framework, along with code and papers. It's easily searched and browsed, which makes projects discoverable.</p> <p>For a dataset analog, Kaggle is similar- they host user-submitted datasets and help other users find them. Both PyTorch Hub and Kaggle have APIs for programmatically downloading artifacts.</p> <table><thead><tr><th align="center"><strong>Pros</strong></th><th align="center"><strong>Cons</strong></th></tr></thead><tbody><tr><td align="center">Browsable & discoverable</td><td align="center">Centrally managed</td></tr><tr><td align="center">Public</td><td align="center">Public (no granularity)</td></tr><tr><td align="center">Good with big models</td><td align="center">Weak versioning support</td></tr></tbody></table> <h1 id="method-three-packaging-with-dvc" style="position:relative;">Method Three: Packaging with DVC<a href="#method-three-packaging-with-dvc" aria-label="method three packaging with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>, or Data Version Control, is a Python project for extending Git version control to large project artifacts like datasets and models. It's not a replacement for Git- DVC works <em>with</em> Git!</p> <p>The basic idea is that your datasets and models are stored in a DVC repository, which can be any cloud storage or server of your choice. DVC creates metadata about file versions that can be tracked by Git and hosted on GitHub- so you can share your datasets and models like any GitHub project, with all the benefits of versioning. Let's look at a case study.</p> <h2 id="creating-a-dvc-project" style="position:relative;">Creating a DVC project<a href="#creating-a-dvc-project" aria-label="creating a dvc project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Say I have a project containing a dataset, model training code, and model.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span> </span>data.csv train.py model.pkl</code></pre></div> <p>Say our model and dataset are large and we want to track them with DVC. For remote storage, we want to use a personal S3 bucket. We would run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git init</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> myremote s3://mybucket/myproject </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc add</span> data.csv model.pkl </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc push</span></span></code></pre></div> <p>When I run these commands, I've initialized Git and DVC tracking. Next, I've set a DVC repository- my S3 bucket. Then I've added <code>data.csv</code> and <code>model.pkl</code> to DVC tracking. Finally, when I run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>, the model and dataset are pushed to the S3 bucket. On my local machine, two meta-files are created: <code>data.csv.dvc</code> and <code>model.pkl.dvc</code>. These can be tracked with Git!</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">ls</span> </span>data.csv.dvc train.py model.pkl.dvc</code></pre></div> <p>So after setting a remote Git repository, <code>git add</code>, <code>commit</code> and <code>push</code> like usual (assuming you are a regualr Git user, that is):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git remote add</span> origin [email protected]:elle/myproject </span><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> <span class="token operator">&&</span> <span class="token function">git</span> commit <span class="token parameter variable">-m</span> <span class="token string">"first commit"</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git push</span> origin master</span></code></pre></div> <h2 id="package-management-with-dvc" style="position:relative;">Package management with DVC<a href="#package-management-with-dvc" aria-label="package management with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Now let's say one of my teammates wants to access my work so far- specifically, they want to see if another method for constructing features from raw data will help model accuracy. I've given them permission to access my GitHub repository. On their local machine, they'll run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://github.com/elle/myproject data.csv model.pkl</span></code></pre></div> <p>This will download the latest version of the <code>data.csv</code> and <code>model.pkl</code> artifacts to their local machine, as well as the DVC metafiles <code>data.csv.dvc</code> and <code>model.pkl.csv</code> indicating the precise version and source.</p> <p>Collaborators can also download artifacts from previous versions, releases, or parallel feature branches of a project. For example, if I released a new version of my project with a Git tag (say <code>v.2.0.1</code>), collaborators can run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--rev</span> v.2.0.1 <span class="token punctuation">\</span> https://github.com/elle/myproject data.csv</span></code></pre></div> <p>Lastly, because <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> maintains a link between the downloaded artifacts and my repository, collaborators can check for project updates with</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc update</span> data.csv model.pkl</span></code></pre></div> <p>If new versions are detected, DVC automatically syncs the local workspace with those versions.</p> <h2 id="when-should-you-do-this" style="position:relative;">When should you do this?<a href="#when-should-you-do-this" aria-label="when should you do this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In my own experience releasing a large public dataset with DVC, I've seen several benefits:</p> <ul> <li>Within an hour, someone found data points I'd been missing. It was straightforward to make a new release after patching this error.</li> <li>Several people modeled my dataset! Highly rewarding.</li> <li>Since GitHub is a widely used platform for code sharing, it's a natural fit for open source scientific projects and has little overhead for potential collaborators</li> </ul> <p>To return to the pros and cons table:</p> <table><thead><tr><th align="center"><strong>Pros</strong></th><th align="center"><strong>Cons</strong></th></tr></thead><tbody><tr><td align="center">Git version your dataset</td><td align="center">No GUI access to files in DVC remote</td></tr><tr><td align="center">Granular sharing permissions</td><td align="center">Collaborators need to use DVC</td></tr><tr><td align="center">DVC abstracts away download scripts/hardcoded URLs</td><td align="center">Can be serverless, but you need to manage cloud storage</td></tr></tbody></table> <h1 id="the-bottom-line" style="position:relative;">The bottom line<a href="#the-bottom-line" aria-label="the bottom line permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h1> <p>Packaging models and datasets is a non-trivial part of the machine learning workflow. DVC provides a method for giving users a Git-centric experience of cloning or forking these artifacts, with an emphasis on <em>versioning artifacts</em> and <em>abstracting away the processes of uploading, downloading, and storing artifacts</em>. For projects with high complexity- like my hair project, which had some gnarly dependencies and big artifacts- this kind of source control pays off. If you don't know where your data came from or how it's been transformed, it's impossible to be scientific.</p> <p>Thanks for stopping by our virtual poster! I'm happy to take questions or comments about how version control fits into the scientific workflow. Leave a comment, reach out on Twitter, or send an email.</p> <h2 id="further-reading" style="position:relative;">Further reading<a href="#further-reading" aria-label="further reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><em>Check out our <a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">tutorial about creating a data registry</a> for more code examples.</em></p>https://dvc.org/blog/dvc-1-0-releasehttps://dvc.org/blog/dvc-1-0-releaseMon, 22 Jun 2020 00:00:00 GMT<h2 id="introduction" style="position:relative;">Introduction<a href="#introduction" aria-label="introduction permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>3 years ago, I was concerned about good engineering standards in data science: data versioning, reproducibility, workflow automation — like continuous integration and continuous delivery (CI/CD), but for machine learning. I wanted there to be a "Git for data" to make all this possible. So I created DVC (Data Version Control), which works as version control for data projects.</p> <p>Technically, DVC codifies your data and machine learning pipelines as text metafiles (with pointers to actual data in S3/GCP/Azure/SSH), while you use Git for the actual versioning. DevOps folks call this approach GitOps or, more specifically, in this case <em>DataOps</em> or <em>MLOps</em>.</p> <p>The new DVC 1.0. is inspired by discussions and contributions from our community of data scientists, ML engineers, developers and software engineers.</p> <h2 id="dvc-10" style="position:relative;">DVC 1.0<a href="#dvc-10" aria-label="dvc 10 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The new DVC 1.0 is inspired by discussions and contributions from our community — both fresh ideas and bug reports 😅. All these contributions, big and small, have a collective impact on DVC's development. I'm confident 1.0 wouldn't be possible without our community. They tell us what features matter most, which approaches work (and which don't!), and what they need from DVC to support their ML projects.</p> <p>A few weeks ago we announced the 1.0 pre-release. After lots of helpful feedback from brave users, it's time to go live. Now, DVC 1.0 is available with all the standard installation methods including <code>pip</code>, <code>conda</code>, <code>brew</code>, <code>choco</code>, and system-specific packages: deb, rpm, msi, pkg. See <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/doc/install</a> for more details.</p> <h2 id="new-features" style="position:relative;">New features<a href="#new-features" aria-label="new features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>It took us 3 years to finalize the requirements for DVC 1.0 and stabilize the commands (API) and DVC file formats. Below are the major lessons that we have learned in 3 years of this journey and how these are reflected in the new DVC.</p> <h3 id="multi-stage-dvc-files" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1871" target="_blank" rel="nofollow noopener noreferrer">Multi-stage DVC files</a><a href="#multi-stage-dvc-files" aria-label="multi stage dvc files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our users taught us that ML pipelines evolve much faster than data engineering pipelines with data processing steps. People need to change the commands of the pipeline often and it was not easy to do this with the old DVC-files.</p> <p>In DVC 1.0, the DVC metafile format was changed in three big ways. First, instead of multiple DVC "stage files" (<code>*.dvc</code>), each project has a single <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> file. By default, all stages go in this single YAML file.</p> <p>Second, we made clear connections between the <code>dvc run</code> command (a helper to define pipeline stages), and how stages are defined in <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>. Many of the options of <code>dvc run</code> are mirrored in the metafile. We wanted to make it far less complicated to edit an existing pipeline by making <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> more human readable and writable.</p> <p>Third, file and directory hash values are no longer stored in the pipeline metafile. This approach aligns better with the GitOps paradigms and simplifies the usage of DVC by tremendously improving metafile human-readability:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">process</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> ./process_raw_data raw_data.log users.csv <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> raw_data.log <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> process_file <span class="token punctuation">-</span> click_threshold <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> users.csv <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> users.csv <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> epochs <span class="token punctuation">-</span> log_file <span class="token punctuation">-</span> dropout <span class="token key atrule">metrics</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> logs.csv <span class="token punctuation">-</span> <span class="token key atrule">summary.json</span><span class="token punctuation">:</span> <span class="token key atrule">cache</span><span class="token punctuation">:</span> <span class="token boolean important">false</span> <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> model.pkl</code></pre></div> <p>All of the hashes have been moved to a special file, <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a>, which is a lot like the old DVC-file format. DVC uses this lock file to define which data files need to be restored to the workspace from data remotes (cloud storage) and if a particular pipeline stage needs to be rerun. In other words, we're separating the human-readable parts of the pipeline into <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, and the auto-generated "machine" parts into <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a>.</p> <p>Another cool change: the auto-generated part (<a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a>) doesn't necessarily have to be stored in your Git repository. The new run-cache feature eliminates the need of storing the lock file in Git repositories. That brings us to our next big feature:</p> <h3 id="run-cache" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1234" target="_blank" rel="nofollow noopener noreferrer">Run cache</a><a href="#run-cache" aria-label="run cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We built DVC with a workflow in mind: one experiment to one commit. Some users love it, but this approach gets clunky fast for others (like folks who are grid-searching a hyperparameter space). Making Git commits for each ML experiment was a requirement with the old DVC, if you wanted to snapshot your project or pipelines on each experiment. Moving forward, we want to give users more flexibility to decide how often they want to commit.</p> <p>We had an insight that data remotes (S3, Azure Blob, SSH etc) can be used instead of Git for storing the codified meta information, not only data. In DVC 1.0, a special structure is implemented, the run-cache, that preserves the state (including all the hashes). Basically, all the information that is stored in the new <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files#dvclock-file"><code>dvc.lock</code></a> file is replicated in the run-cache.</p> <p>The advantage of the run-cache is that pipeline runs (and output file versions) are not directly connected to Git commits anymore. The new DVC can store all the runs in the run-cache, even if they were never committed to Git.</p> <p>This approach gives DVC a "long memory" of DVC stages runs. If a user tries to run a stage that was previously run (whether committed to Git or not), then DVC can return the result from the run-cache without rerunning it. It is a useful feature for a hyperparameter optimization stage — when users return to the previous sets of the parameters and don't want to wait for ML retraining.</p> <p>Another benefit of the run-cache is related to CI/CD systems for ML, which is a holy grail of MLOps. The long memory means users don't have to make auto-commits in their CI/CD system side - see <a href="https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments" target="_blank" rel="nofollow noopener noreferrer">this Stackowerflow question</a>.</p> <h3 id="plots" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3409" target="_blank" rel="nofollow noopener noreferrer">Plots</a><a href="#plots" aria-label="plots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Countless users have asked when we'd support metrics visualizations. It became clear that metrics and their visualization are an essential part of <em>DataOps</em>, especially when it comes down to navigation around ML models, datasets and experiments. Now it's here: DVC 1.0 introduces metrics file visualization commands, <a href="https://dvc.org/doc/command-reference/plots/diff"><code>dvc plots diff</code></a> and <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a>. This is brand-new functionality in DVC and it's <em>in experimental mode</em> now.</p> <p>This function is designed not only for visualizing the current state of your project, but also for comparing plots across your Git history. Users can visualize how, for example, their model accuracy in the latest commit differs from another commit (or even multiple commits).</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march </span>file:///Users/dmitry/src/plot/logs.csv.html <span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div> <p><img src="https://dvc.org/2020-05-04/dvc-plots-092248e6898ab510fc3803efb5e22d9f.svg" alt=""></p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march <span class="token punctuation">\</span> <span class="token parameter variable">-x</span> loss <span class="token parameter variable">--template</span> scatter </span>file:///Users/dmitry/src/plot/logs.csv.html <span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div> <p><img src="https://dvc.org/2020-05-04/dvc-plots-scatter-9cfc6c2078273faa482129d8d1609967.svg" alt=""></p> <p>DVC plots are powered by the <a href="https://vega.github.io/vega-lite/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite graphic library</a>. We picked Vega because it's high-level to manipulate, compatible with all ML frameworks, and looks great out of the box. However, you don't have to know Vega to use DVC plots: we've provided default templates for line graphs, scatterplots, and confusion matrices, so you can just point DVC plots to your metrics and go.</p> <h3 id="data-transfer-optimizations" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3488" target="_blank" rel="nofollow noopener noreferrer">Data transfer optimizations</a><a href="#data-transfer-optimizations" aria-label="data transfer optimizations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In <em>DataOps</em>, data transfer speed is hugely important. We've done substantial work to optimize data management commands, like <a href="https://dvc.org/doc/command-reference/pull#-c"><code>dvc pull / push / status -c / gc -c</code></a>. Now, based on the amount of data to move, DVC can choose the optimal strategy for traversing your data remote.</p> <p><a href="https://github.com/iterative/dvc/issues/2147" target="_blank" rel="nofollow noopener noreferrer">Mini-indexes</a> help DVC instantly check data directories instead of iterating over millions of files. This also speeds up adding/removing files to/from large directories.</p> <p>More optimizations are included in the release based on our profiling of performance bottlenecks. More detailed <a href="https://gist.github.com/pmrowla/338d9645bd05df966f8aba8366cab308" target="_blank" rel="nofollow noopener noreferrer">benchmark reports</a> show how many seconds it takes to run specific commands on a directory containing 2 million images.</p> <p><img src="https://dvc.org/2020-05-04/benchmarks-fb3909a1a199bbfdfb5b66b689e2ffb0.svg" alt=""></p> <h3 id="hyperparameter-tracking" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">Hyperparameter tracking</a><a href="#hyperparameter-tracking" aria-label="hyperparameter tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This feature was actually released in the last DVC 0.93 version (see the <a href="https://dvc.org/doc/command-reference/params" target="_blank" rel="nofollow noopener noreferrer">params docs</a>. However, it is an important step to support configuration files and ML experiments in a more holistic way.</p> <p>The parameters are a special type of dependency in the pipelines. This is the way of telling DVC that a command depends not on a file (<code>params.yaml</code>) but on a particular set of values in the file:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> users.csv <span class="token parameter variable">-o</span> model.pkl <span class="token punctuation">\</span> <span class="token parameter variable">--params</span> lr,train.epochs,train.layers <span class="token punctuation">\</span> python train.py</span></code></pre></div> <p>The <code>params.yaml</code> file is the place where the parameters are stored:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">lr</span><span class="token punctuation">:</span> <span class="token number">0.0041</span> <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">epochs</span><span class="token punctuation">:</span> <span class="token number">70</span> <span class="token key atrule">layers</span><span class="token punctuation">:</span> <span class="token number">9</span> <span class="token key atrule">process</span><span class="token punctuation">:</span> <span class="token key atrule">thresh</span><span class="token punctuation">:</span> <span class="token number">0.98</span> <span class="token key atrule">bow</span><span class="token punctuation">:</span> <span class="token number">15000</span></code></pre></div> <h3 id="stable-releases-cycles" style="position:relative;">Stable releases cycles<a href="#stable-releases-cycles" aria-label="stable releases cycles permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Today, many teams use DVC in their daily job for modeling and as part of their production MLOps automation systems. Stability plays an increasingly important role.</p> <p>We've always prioritized agility and speed in our development process. There have been weeks with two DVC releases! This approach had a ton of benefits in terms of learning speed and rapid feedback from users.</p> <p>Now we're seeing signs that it's time to shift gears. Our API is stabilized and version 1.0 is built with our long-term vision in mind. Our user-base has grown and we have footing with mature teams - teams that are using DVC in mission-critical systems. That's why we're intentionally going to spend more time on release testing in the future. It might increase the time between releases, but the quality of the tool will be more predictable.</p> <p>Additionally, we've already implemented a benchmark testing framework to track performance across potential releases: <a href="https://iterative.github.io/dvc-bench/" target="_blank" rel="nofollow noopener noreferrer">https://iterative.github.io/dvc-bench/</a> In this website, anyone can see the performance improvements and degradations for every release candidate:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 605px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/44a8b224d00774bd0be1a55b2c98ad45/39600/dvc-benchmark.png" alt="dvc benchmark" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="for-more-information-on-the-new-features" style="position:relative;">For more information on the new features…<a href="#for-more-information-on-the-new-features" aria-label="for more information on the new features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Each of these new features has a story that could fill a separate blog post - so that's what we'll be doing. We'll be posting more soon. <a href="https://github.com/pmrowla" target="_blank" rel="nofollow noopener noreferrer">Peter Rowlands</a> will be writing a blog post about the performance optimization in DVC 1.0, <a href="https://github.com/pared" target="_blank" rel="nofollow noopener noreferrer">Paweł Redzyński</a> about versioning and visualizing plots, <a href="https://github.com/skshetry" target="_blank" rel="nofollow noopener noreferrer">Saugat Pachhai</a> about the new DVC file formats and pipelines, and <a href="https://github.com/efiop" target="_blank" rel="nofollow noopener noreferrer">Ruslan Kuprieiev</a> about run-cache.</p> <p>Please stay in touch and subscribe to our newsletter in <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">http://dvc.org</a>.</p> <h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>It's quite a journey to build an open source project in the ML/AI space. We're fortunate to have a community of DVC users, contributors and cheerleaders. All these folks tremendously help us to define, test and develop the project. We've reached this significant milestone of version 1.0 together and I hope we'll continue working on DVC and bringing the best practices of DataOps and MLOps to the ML/AI space.</p> <p>Thank you again! And please be in touch on <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/june-20-dvc-heartbeathttps://dvc.org/blog/june-20-dvc-heartbeatMon, 08 Jun 2020 00:00:00 GMT<p>Welcome to the June Heartbeat, our monthly roundup of cool happenings, <a href="#from-the-community">good reads</a> and <a href="#coming-up-soon">up-and-coming developments</a> in the DVC community.</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In the beginning of May, we <a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">pre-released DVC 1.0</a>. Ever since, we've been putting the final touches on 1.0- wrapping up features, fixing bugs 🐛, and responding to feedback from intrepid users <a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">trying the pre-release</a>. To recap, here are some of the big features coming:</p> <ul> <li> <p><strong>Plots powered by Vega-Lite</strong> We're building <a href="https://dvc.org/doc/command-reference/plots#plots" target="_blank" rel="nofollow noopener noreferrer">functions for visualizing metrics</a> in your project, as well as comparing metrics across commits. We chose <a href="https://github.com/vega/vega-lite" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite plots</a> because they're high-level, compatible with ML projects written in any language, and beautiful by default.</p> </li> <li> <p><strong>Human readable and writeable pipelines.</strong> We're reworking pipelines so you can modify dependencies, outputs, metrics, plots, and entire stages easily: via manual edits to a <code>.yaml</code> pipeline fines. This redesign will consolidate pipeline <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files into a single file (yay, simpler working directory). No worries for pipeline enthusiasts- DVC 1.0 is backwards compatible, so your existing projects won't be interrupted.</p> </li> <li> <p><strong>Run cache.</strong> One of the most exciting features is the run-cache, a local record of pipeline versions that have previously been run and the outputs of those runs. It can seriously cut down on compute time if you find yourself repeating pipeline executions. For our CI/CD users, it also offers a way to save the output of your pipeline- like models or results- <a href="https://stackoverflow.com/questions/61245284/will-you-automate-git-commit-into-ci-cd-pipline-to-save-dvc-run-experiments" target="_blank" rel="nofollow noopener noreferrer">without auto-commits</a>.</p> </li> </ul> <p>DVC 1.0 work has been our top priority this past month, and we are <em>extremely close</em> to the releae. Think 1-2 weeks!</p> <p>Another neat announcement: DVC moved up on <a href="https://www.thoughtworks.com/radar/tools" target="_blank" rel="nofollow noopener noreferrer">ThoughtWorks Technology Radar</a>! To quote ThoughtWorks:</p> <blockquote> <p>In 2018 we mentioned DVC in conjunction with the versioning data for reproducible analytics. Since then it has become a favorite tool for managing experiments in machine learning (ML) projects. Since it's based on Git, DVC is a familiar environment for software developers to bring their engineering practices to ML practice. Because it versions the code that processes data along with the data itself and tracks stages in a pipeline, it helps bring order to the modeling activities without interrupting the analysts’ flow.</p> </blockquote> <p>And here we are on the radar, in the Trial zone:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 377px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e25ad107f180331a3e3ddca2064d16d5/39600/radar.png" alt="radar" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Blip, blip, blip!</em></p> <p>We are honored. In fact, this was validating in several ways. We field a lot of questions about our decision to build around Git, rather than creating a platform. It's awesome to know our approach is resonating with teams at the intersection of ML and software development. Thanks, ThoughtWorks!</p> <p>Last up in company news: you might recall that in early May, we hosted an online meetup. <a href="http://mribeirodantas.me" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a> hosted guest talks from <a href="https://github.com/ehutt" target="_blank" rel="nofollow noopener noreferrer">Elizabeth Hutton</a> and <a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a>- we heard about constructing a new COVID-19 dataset, using DVC with transformer language models, and building custom cloud infrastructure for MLOps. There's also Q&A with the DVC team, where we fielded audience questions. A video of the meetup is available now, so check it out if you missed the event.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/19GMtrFykSU?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As usual, there's a ton of noteworthy action in the DVC community.</p> <p><a href="https://twitter.com/dhaynes23" target="_blank" rel="nofollow noopener noreferrer">Derek Haynes</a>, MLOps expert and new <a href="https://dvc.org/blog/dvc-ambassador-program-announcement" target="_blank" rel="nofollow noopener noreferrer">DVC Ambassador</a>- wrote an excellent overview of using <a href="https://github.com/features/codespaces/" target="_blank" rel="nofollow noopener noreferrer">GitHub CodeSpaces</a>. CodeSpaces is a new development environment (currently in beta) that we're eagerly watching. As Derek shows in his blog, it lets you have a Jupyter Notebook experience without sacrificing on development standards- he uses <a href="https://docs.whisk-ml.org/en/latest/" target="_blank" rel="nofollow noopener noreferrer">whisk</a> to structure the project and manage Python package dependencies, and DVC to version the model training pipeline.</p> <p>This use case is telling in the <a href="https://towardsdatascience.com/the-case-against-the-jupyter-notebook-d4da17e97243" target="_blank" rel="nofollow noopener noreferrer">battle over Jupyter notebooks</a>: we might just be able to have both a notebook <em>and</em> mature project management. Give Derek's blog a read and tell us what you think.</p> <p> </p><section class="elp-content-holder"> <a href="https://dlite.cc/2020/05/26/github-codespaces-machine-learning.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">GitHub Codespaces for Machine Learning</h4> <div class="elp-description">With Codespaces, contributors can spin up a ready-to-go GitHub project-specific dev environment in the cloud. In this post, I’ll show how to give potential contributors a graceful start by configuring Codespaces for an ML project.</div> <div class="elp-link">dlite.cc</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-06-08/derek_haynes-c4995bfd3020af81632b8d434220c631.jpg" alt="GitHub Codespaces for Machine Learning"> </div> </a> </section> <p></p> <p>DVC Ambassador Marcel gave a tutorial about DVC to a bioinformatics student group, and then an even bigger talk at the Federal University of Rio Grande de Norte. His talk focused on how to use DVC in the context of scientific reproducibility- specifically, large biological datasets, which are often transformed and processed several times before ML models are fit. In my experience, Git-flow is severely underutilized in life sciences research, so it's exciting to see Marcel's ideas getting a big audience.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="pt" dir="ltr">Interessados(as) na área de Ciência de Dados? Na próxima sexta-feira as 14h teremos uma palestra sobre uma das novíssimas ferramentas da área, o DVC - Data Version Control!!! Não percam essa oportunidade. <a href="https://twitter.com/ufrnbr">@ufrnbr</a> <a href="https://twitter.com/PropesqUFRN">@PropesqUFRN</a> <a href="https://t.co/AmXxz7ioVG">pic.twitter.com/AmXxz7ioVG</a></p>— ppgeecufrn (@ppgeecufrn) <a href="https://twitter.com/ppgeecufrn/status/1263260554443005954">May 21, 2020</a></blockquote> <p>Also, Marcel is the first author of a new scientific paper about mobility data across 131 countries during the COVID-19 pandemic. The preprocessing pipeline is versioned with DVC. We don't know how Marcel gets this much done.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.sciencedirect.com/science/article/pii/S2352340920305928" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Dataset for country profile and mobility analysis in the assessment of COVID-19 pandemic</h4> <div class="elp-description">M. Ribeiro-Dantas, G. Alves, R.B. Gomes, L.C.T. Bezerra, L. Lima and I. Silva</div> <div class="elp-link">sciencedirect.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-06-08/data_in_brief_logo-2b4c63ad516deb095fbb38327c04e53d.jpeg" alt="Dataset for country profile and mobility analysis in the assessment of COVID-19 pandemic"> </div> </a> </section> <p></p> <p>Also just released is a scientific paper by Christoph Jansen et al. about a framework for computational reproducibility in the life sciences that integrates DVC. The framework is called <a href="https://github.com/curious-containers/curious-containers" target="_blank" rel="nofollow noopener noreferrer">Curious Containers</a>- definitely worth checking out for biomedical researchers interested in deep learning.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.sciencedirect.com/science/article/abs/pii/S0167739X19318096" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Curious Containers: A framework for computational reproducibility in life sciences with support for Deep Learning applications</h4> <div class="elp-description">C. Jansen, J. Annuscheit, B. Schilling, K. Strohmenger, M. Whitt, F. Bartusch, C. Herta, P. Hufnagl, and D. Krefting</div> <div class="elp-link">sciencedirect.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-06-08/fgcs_cover-77a45f0e89b711a9797ac2137c86e70b.jpg" alt="Curious Containers: A framework for computational reproducibility in life sciences with support for Deep Learning applications"> </div> </a> </section> <p></p> <p>In other work of vital interest to the good of humanity, this month has seen some awesome applictions of the <a href="https://dvc.org/blog/a-public-reddit-dataset" target="_blank" rel="nofollow noopener noreferrer">public Reddit dataset we released in February</a>. The dataset is designed for an NLP task of mighty importance: will Redditors vote that the poster is an asshole, or not?</p> <p>Daniele Gentile beat our benchmark classifier (62% accuracy, but not bad for logistic regression!) with Doc2Vec embeddings and a 500-neuron network. He got 71% accuracy on held out data- nice! His blog is a fun read, and code's included if you want to follow along.</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@danielegentili/artificial-intelligence-confirms-you-are-an-a-hole-e8eef354dc2" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Artificial Intelligence confirms you are an a**hole</h4> <div class="elp-description">Q-LO is a small artificial brain that can determine if you are the a**hole or not in a situation from its description.</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-06-08/medium_logo-45140ce1eb5fe8d0caed749229873cca.png" alt="Artificial Intelligence confirms you are an a**hole"> </div> </a> </section> <p></p> <p>Elsewhere on the internet, data scientist Dan Cassin delivered this beautiful tweet:</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Used a dataset from <a href="https://t.co/6yDX1A9Rga">https://t.co/6yDX1A9Rga</a> on <a href="https://twitter.com/Reddit">@reddit</a>'s AITA, used NLTK for processing, TFIDF, then UMAP, and the result is the coolest, but most unhelpful graph I've ever made. <a href="https://twitter.com/matplotlib">@matplotlib</a> <a href="https://t.co/fYpuvwTIYE">pic.twitter.com/fYpuvwTIYE</a></p>— Dan Cassin (@Dan_Cassin) <a href="https://twitter.com/Dan_Cassin/status/1256999648901787648">May 3, 2020</a></blockquote> <p>Last, I want to point you to two other excellent blogs. <a href="https://github.com/curiousily" target="_blank" rel="nofollow noopener noreferrer">Venelin Valkov</a> released a blog, <a href="https://www.curiousily.com/posts/reproducible-machine-learning-and-experiment-tracking-pipiline-with-python-and-dvc/" target="_blank" rel="nofollow noopener noreferrer">Reproducible machine learning and experiment tracking pipeline with Python and DVC</a>, that contains not only a detailed sample project but a livecoding video!</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/6_kK6wRtzhk?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p><a href="https://www.linkedin.com/in/matthewmcateer0/" target="_blank" rel="nofollow noopener noreferrer">Matthew McAteer</a> revisited the famous 2015 <a href="https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf" target="_blank" rel="nofollow noopener noreferrer">Hidden Technical Debt in Machine Learning Systems</a> paper to ask which recommendations still work five years later. It's pretty great- <a href="https://matthewmcateer.me/blog/machine-learning-technical-debt/" target="_blank" rel="nofollow noopener noreferrer">please read it</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 279.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/53bf1aaa1578f0c5f73792de45c56435/4e10f/spongebob.png" alt="spongebob" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Meme by Matthew McAteer. Click to enlarge.</em></p> <h2 id="coming-up-soon" style="position:relative;">Coming up soon<a href="#coming-up-soon" aria-label="coming up soon permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are a couple of events to look forward to in the next few weeks. I'll be speaking at two conferences: first, <a href="https://mlopsworld.com/program/" target="_blank" rel="nofollow noopener noreferrer">MLOps World</a> about CI/CD and ML. Next, I'm <a href="https://computationalaudiology.com/the-critical-role-of-computing-infrastructure-in-computational-audiology/" target="_blank" rel="nofollow noopener noreferrer">organizing a workshop</a> at the Virtual Conference on Computational Audiology. To get ready, I'm gathering resources about good computing practices for scientists and biomedical research labs- <a href="https://github.com/andronovhopf/Lab_Computing_Resources" target="_blank" rel="nofollow noopener noreferrer">contributions are welcome</a>.</p> <p>Another talk on our radar is at EuroPython 2020. Engineer <a href="https://ep2020.europython.eu/talks/CXG7TcM-automating-machine-learning-workflow-with-dvc/" target="_blank" rel="nofollow noopener noreferrer">Hongjoo Lee will be talking about building a CI/CD workflow for ML with DVC</a>- we're very interested to learn about their approach.</p> <p>Lastly, <a href="http://ml-repa.ru/" target="_blank" rel="nofollow noopener noreferrer">ML REPA</a> leader and new DVC Ambassador <a href="https://twitter.com/mnrozhkov" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> is working on a Udemy course about DVC. Look for more updates this summer!</p> <p>Thanks for reading this month. As always, we're proud of the ways our community works for better, more rigorous ML.</p>https://dvc.org/blog/may-20-community-gemshttps://dvc.org/blog/may-20-community-gemsTue, 26 May 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p> <h3 id="q-how-do-i-completely-delete-a-file-from-dvc" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/710546561498873886" target="_blank" rel="nofollow noopener noreferrer">How do I completely delete a file from DVC?</a><a href="#q-how-do-i-completely-delete-a-file-from-dvc" aria-label="q how do i completely delete a file from dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To stop tracking a file with DVC, you can simply delete the file and its corresponding <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file (if there is one) from your project. But, what if you want to entirely erase a file from DVC?</p> <p>After deleting the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file, you'll usually want to <a href="https://dvc.org/doc/command-reference/gc#gc" target="_blank" rel="nofollow noopener noreferrer">clear your DVC cache</a>. Ordinarily, that's done with <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>. However, if there's any chance the file you wish to remove might be referenced by another commit (even under a different name), be sure to use the right flag: <a href="https://dvc.org/doc/command-reference/gc#--all-commits"><code>dvc gc --all-commits</code></a>.</p> <p>If you want to remove a single <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file without doing a cache cleanup, look into the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file and note the <code>md5</code> field inside. Then use this value to identify the corresponding file in your <code>.dvc/cache</code> and delete it. For example: if your target file has <code>md5</code>: 123456, the corresponding file in your cache will be <code>.dvc/cache/12/3456</code>.</p> <p>There's one last case worth mentioning: what if you're deleting a file inside a DVC-tracked folder? For example, say you've previously run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc">dvc add data_dir</code></pre></div> <p>and now want to remove a single file (say, <code>image_1.png</code>) from <code>data_dir</code>. When DVC starts tracking a directory, it creates a corresponding <code>.dir</code> file inside <code>.dvc/cache</code> that lists every file and subfolder, as well as an <code>md5</code> for each, in a JSON format. You'll want to locate this <code>.dir</code> file in the cache, and then find the entry corresponding to <code>image_1.png</code>. It'll give the <code>md5</code> for <code>image_1.png</code>. Finally, go back to <code>.dvc/cache</code>, identify the file corresponding to that <code>md5</code>, and delete it. For detailed instructions about <code>.dir</code> files, where to find them and how they're used, <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">see our docs about the structure of the cache</a>.</p> <p>Having said all this… please know that in the future, we plan to support a function like <code>git rm</code> that will allow easier deletes from DVC!</p> <h3 id="q-is-it-safe-to-add-a-custom-file-to-my-dvc-remote" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/707551737745244230https://discord.com/channels/485586884165107732/563406153334128681/707551737745244230" target="_blank" rel="nofollow noopener noreferrer">Is it safe to add a custom file to my DVC remote?</a><a href="#q-is-it-safe-to-add-a-custom-file-to-my-dvc-remote" aria-label="q is it safe to add a custom file to my dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Definitely. Some people add additional files to their DVC remote, like a README to explain to teammates what the folder is being used for. Having an additional file in the remote that isn't part of DVC tracking won't pose any issues. You would only encounter problems if you were manually modifying or deleting contents of the remote managed by DVC.</p> <h3 id="q-are-there-limits-to-how-many-files-dvc-can-handle-my-dataset-contains-100000-files" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/706538115048669274" target="_blank" rel="nofollow noopener noreferrer">Are there limits to how many files DVC can handle? My dataset contains ~100,000 files.</a><a href="#q-are-there-limits-to-how-many-files-dvc-can-handle-my-dataset-contains-100000-files" aria-label="q are there limits to how many files dvc can handle my dataset contains 100000 files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We ourselves have stored datasets containing up to 2 million files, so 100,000 is certainly feasible. Of course, the larger your dataset, the more time data transfer operations will take. Luckily, we have a <a href="https://dvc.org/blog/dvc-3-years-and-1-0-release#data-transfer-optimizations" target="_blank" rel="nofollow noopener noreferrer">DVC 1.0 contains several data transfer optimizations</a> to substantially reduce the time needed to <a href="https://dvc.org/doc/command-reference/pull#-c"><code>dvc pull / push / status -c / gc -c</code></a> for very large datasets.</p> <h3 id="q-two-developers-on-my-team-are-doing-dvc-push-to-the-same-remote-should-they-dvc-pull-first" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/704211629075857468" target="_blank" rel="nofollow noopener noreferrer">Two developers on my team are doing <code>dvc push</code> to the same remote. Should they <code>dvc pull</code> first?</a><a href="#q-two-developers-on-my-team-are-doing-dvc-push-to-the-same-remote-should-they-dvc-pull-first" aria-label="q two developers on my team are doing dvc push to the same remote should they dvc pull first permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It's safe to push simultaneously, no <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> needed. While some teams might be in the habit of frequently pulling, like in Git flow, there are less risks of "merge conflicts" in DVC. That's because DVC remotes stores files indexed by <code>md5</code>s, so there's usually a very low probability of a collision (if two developers have two different versions of a file, they'll be stored as two separate files in the DVC remote- so no merge conflicts).</p> <h3 id="q-what-are-tmp-files-in-my-dvc-remote" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/563406153334128681/698163554095857745" target="_blank" rel="nofollow noopener noreferrer">What are <code>*.tmp</code> files in my DVC remote?</a><a href="#q-what-are-tmp-files-in-my-dvc-remote" aria-label="q what are tmp files in my dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Inside your DVC remote, you might see <code>.tmp</code> files from incomplete uploads. This can happen if a user killed a process like <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>. You can safely remove them; for example, if you're using an S3 bucket, <code>aws s3 rm ... *.tmp</code> will do the trick.</p> <p>One caveat: before you delete, make sure no one is actively running <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a>.</p> <h3 id="q-im-using-a-google-cloud-platform-gcp-bucket-as-a-dvc-remote-and-getting-an-error-any-ideas" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/485596304961962003/705131622537756702" target="_blank" rel="nofollow noopener noreferrer">I'm using a Google Cloud Platform (GCP) bucket as a DVC remote and getting an error. Any ideas?</a><a href="#q-im-using-a-google-cloud-platform-gcp-bucket-as-a-dvc-remote-and-getting-an-error-any-ideas" aria-label="q im using a google cloud platform gcp bucket as a dvc remote and getting an error any ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you're getting the error,</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">ERROR: unexpected error - ('invalid_grant: Bad Request', '{\n "error": "invalid_grant",\n "error_description": "Bad Request"\n}')</code></pre></div> <p>something is going wrong with your GCP authentication! A few things to check: first, <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">check out our docs</a> to <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a> a Google Cloud bucket as your remote. Note that before DVC can use this type of remote, you have to configure your credentials through the GCP CLI (<a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">see docs here</a>).</p> <p>If you're still getting an error, DVC probably can't find the <code>.json</code> credentials file for your GCP bucket. Try authenticating using <code>gcloud beta auth application-default login</code>. This command obtains your access credentials and places them in a <code>.json</code> in your local workspace.</p> <h3 id="q-im-working-on-several-projects-that-all-need-involve-the-same-saved-model-one-project-trains-a-model-and-pushes-it-to-cloud-storage-with-dvc-push-and-another-takes-the-model-out-of-cloud-storage-for-use-whats-the-best-practice-for-doing-this-with-dvc" style="position:relative;">Q: <a href="https://discord.com/channels/485586884165107732/485596304961962003/708318821253120040" target="_blank" rel="nofollow noopener noreferrer">I'm working on several projects that all need involve the same saved model. One project trains a model and pushes it to cloud storage with <code>dvc push</code>, and another takes the model out of cloud storage for use. What's the best practice for doing this with DVC?</a><a href="#q-im-working-on-several-projects-that-all-need-involve-the-same-saved-model-one-project-trains-a-model-and-pushes-it-to-cloud-storage-with-dvc-push-and-another-takes-the-model-out-of-cloud-storage-for-use-whats-the-best-practice-for-doing-this-with-dvc" aria-label="q im working on several projects that all need involve the same saved model one project trains a model and pushes it to cloud storage with dvc push and another takes the model out of cloud storage for use whats the best practice for doing this with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>One of DVC's goals is to make it easy to move models and datasets in and out of cloud storage. We had this in mind when we designed the function <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> - it lets you reuse artifacts from one project to another. And you can quickly synchronize an artifact, like a model or dataset, with its latest version using <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a>. Check out our <a href="https://dvc.org/doc/command-reference/import" target="_blank" rel="nofollow noopener noreferrer">docs about <code>import</code></a>, and also our <a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">data registry use case</a> for an example of sharing artifacts across projects.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 690.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/da720a4b7b9b33a811b2b4fb6b176e86/39600/data-registry.png" alt="data registry" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Using DVC for sharing artifacts like datasets and models across projects and teammates.</em></p>https://dvc.org/blog/may-20-dvc-heartbeathttps://dvc.org/blog/may-20-dvc-heartbeatThu, 14 May 2020 00:00:00 GMT<p>Welcome to the May Heartbeat, our <a href="#news">monthly roundup of cool happenings</a>, <a href="#new-releases">new releases</a>, <a href="#from-the-community">good reads</a> and other noteworthy developments the DVC community.</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><strong>DVC turns 3.</strong> On May 4th, we celebrated DVC's third birthday! Fearless leader Dmitry Petrov <a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">wrote a retrospective</a> about how the team has grown and what we've learned from our users, contributors, and colleagues. Thanks to everyone who celebrated with us!</p> <p><strong>Ambassador program launched.</strong> DVC has just kicked off our ambassador program with the help of our first ambassador, <a href="https://twitter.com/messages/40813700-894970070358564864" target="_blank" rel="nofollow noopener noreferrer">Marcel Ribeiro-Dantas</a>. Marcel is an early-stage researcher at the Institut Curie, a veteran <a href="https://fedoraproject.org/wiki/User:Mribeirodantas" target="_blank" rel="nofollow noopener noreferrer">ambassador of the Fedora Project</a>, and a <a href="http://mribeirodantas.me/" target="_blank" rel="nofollow noopener noreferrer">data science blogger</a>. Becoming an ambassador is a way for folks who are passionate about contributing to the DVC community to get recognized for their efforts. It's also a way for us to help volunteers with financial support for meetups and travel, as well as chances to work more closely with our team. The program is ideal for anyone who already likes blogging about DVC, contributing code, and hosting get-togethers (virtual or otherwise), but especially advanced students and early career data scientists and engineers! <a href="https://dvc.org/blog/dvc-ambassador-program-announcement" target="_blank" rel="nofollow noopener noreferrer">Learn more about it here.</a></p> <p><strong>DVC is part of 2020 Google Season of Docs.</strong> Another way to get involved with DVC is through <a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a>, a program we're participating in for the second year in a row. This program is for technical writers to get paid experience working with the DVC team in fall 2020. Right now, we're accepting proposals from interested writers. <a href="https://dvc.org/blog/gsod-ideas-2020" target="_blank" rel="nofollow noopener noreferrer">Find out more here.</a></p> <p><strong>5000 GitHub Stars.</strong> It finally happened- we passed 5,000 stars <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">on our GitHub repo!</a></p> <p><img src="https://media.giphy.com/media/igWE67cPgTrWwXq4Nz/giphy.gif" alt="Animated GIF"></p> <h2 id="new-releases" style="position:relative;">New releases<a href="#new-releases" aria-label="new releases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Coincident with DVC's 3rd birthday, we shared a pre-release of DVC 1.0. The release is expected in a few weeks, but you can experiment with 1.0 now (and make <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">tickets in our project repo</a> if you get a bug 🐛). Some major new features include:</p> <ul> <li> <p><strong>Run cache</strong>, a cache of pipelines you've reproduced on your local workspace. If you re-run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> on a pipeline version that's already been executed, run cache will save you compute time by returning the cached result.</p> </li> <li> <p><strong>Multi-stage DVC files</strong>. Users reported that their DVC pipelines changed a lot, so we've made pipeline <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files more human-readable and editable for fast redesigns.</p> </li> <li> <p><strong>Plots</strong> We've got plots powered by <a href="https://vega.github.io/vega-lite/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite</a> for making beautiful vizualizations comparing model performance across commits! Developer Paweł Redzyński is hard at work:</p> </li> </ul> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Visual aids come to DVC 1.0, with my little help. <a href="https://t.co/Fd1qVr7rHb">pic.twitter.com/Fd1qVr7rHb</a></p>— Pablito (@Paffciu1) <a href="https://twitter.com/Paffciu1/status/1260119918525194241">May 12, 2020</a></blockquote> <p>You can read more about the big updates coming in DVC 1.0 <a href="https://dvc.org/blog/dvc-3-years-and-1-0-release#dvc-10-is-the-result-of-3-years-of-learning" target="_blank" rel="nofollow noopener noreferrer">in our birthday blog</a>.</p> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Developers weren't the only ones hustling this month…</p> <p><strong>First ever virtual DVC Meetup.</strong> Marcel, our new ambassador, lead an initiative to <a href="https://tulu.la/events/dvc-virtual-meetup-2020-00032c" target="_blank" rel="nofollow noopener noreferrer">organize a virtual meetup</a>! Marcel shared his latest scientific work about creating a <a href="https://www.sciencedirect.com/science/article/pii/S2352340920305928?via%3Dihub" target="_blank" rel="nofollow noopener noreferrer">new comprehensive dataset about mobility</a> during the COVID-19 pandemic and then passed off the mic to our two guest speakers. Data scientist <a href="https://github.com/ehutt" target="_blank" rel="nofollow noopener noreferrer">Elizabeth Hutton</a> spoke how she was building a workflow for her NLP team with DVC, and <a href="https://dagshub.com/" target="_blank" rel="nofollow noopener noreferrer">DAGsHub</a> co-founder <a href="https://twitter.com/DeanPlbn" target="_blank" rel="nofollow noopener noreferrer">Dean Pleban</a> shared his custom remote file system setup for modeling Reddit post popularity. It was quite well-attended for our first ever virtual hangout: we logged 40 individual logins to the meetup with more than 30 people staying the whole time! A video of the meetup is <a href="https://tulu.la/events/dvc-virtual-meetup-2020-00032c" target="_blank" rel="nofollow noopener noreferrer">on the event page</a>, so you can still check out the talks and discussion we enjoyed.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">It was awesome speaking at the <a href="https://twitter.com/DVCorg">@DVCorg</a> meetup about <a href="https://twitter.com/Reddit">@reddit</a> post popularity prediction and DVC <a href="https://twitter.com/hashtag/remote?src=hash&ref_src=twsrc%5Etfw">#remote</a> working file systems. Also a lot of <a href="https://twitter.com/hashtag/DAGs?src=hash&ref_src=twsrc%5Etfw">#DAGs</a>. <a href="https://t.co/5WKTlIEvHK">pic.twitter.com/5WKTlIEvHK</a></p>— Dean 🐶 (@DeanPlbn) <a href="https://twitter.com/DeanPlbn/status/1258475031530790916">May 7, 2020</a></blockquote> <p><strong>Some blogs we like.</strong> As usual, there's a lot of share-worthy writing in the data science and MLOps space:</p> <ul> <li><a href="https://twitter.com/ixek" target="_blank" rel="nofollow noopener noreferrer">Tania Allard</a> wrote an intensely readable, extremely sharp guide to practical steps anyone can take to improve the reproducibility of their ML projects. She really nails the complexity of the workflow and the importance of decoupling code and data (which we obviously agree with very much 😏). The graphics are also 💯- Tania is a developer advocate to follow.</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://dev.to/azure/10-top-tips-for-reproducible-machine-learning-36g0" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">10 top tips for reproducible Machine Learning</h4> <div class="elp-description">The one where you get some advice to make your workflows more reproducible</div> <div class="elp-link">dev.to</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-05-14/dev_logo-4362b64c557ebe87d5d8d21ad965ffaf.png" alt="10 top tips for reproducible Machine Learning"> </div> </a> </section> <p></p> <ul> <li><a href="https://medium.com/@vimarshk" target="_blank" rel="nofollow noopener noreferrer">Vimarsh Karbhari</a> blogged about how teams that work with data can strategize better about versioning their data and analysis pipelines. On the opposite end of giving very practical recommendations, Vimarsh stresses a deliberate and careful approach. He emphasizes how the team's choices should depend on factors like project maturity and how much flexibility is going to be needed. It's a solid overview of how to begin thinking about MLOps at a high level.</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/acing-ai/ml-ops-data-science-version-control-5935c49d1b76" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">ML Ops: Data Science Version Control</h4> <div class="elp-description">Data versioning primer for model, data and code.</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-05-14/acing_ai-3efe392da9b0c56e9f6f3bbae8e08580.png" alt="ML Ops: Data Science Version Control"> </div> </a> </section> <p></p> <ul> <li>Over at <a href="https://www.autoregressed.com/" target="_blank" rel="nofollow noopener noreferrer">AutoRegresed</a>, Jack Pitts shared a thorough tutorial about using <a href="https://pypi.org/project/pipenv/" target="_blank" rel="nofollow noopener noreferrer">Pipenv</a>, DVC and Git together. As a trio, this manages dependencies and versions the working environment, source code, dataset <em>and</em> trained models. It's not only a cool use case, but a very clear step-by-step explanation that should be easy to try at home. Stay till the end for a neat trick about deploying a model as a web service with Pipenv and DVC.</li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://www.autoregressed.com/blog/pipenv-and-dvc-reproducibility-in-data-science" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Pipenv and DVC: Reproducibility in Data Science</h4> <div class="elp-description">Without standards and tools to easily reproduce models, Data Science teams can become bogged down in technical debt that will make it difficult to deploy and iterate on models. </div> <div class="elp-link">autoregressed.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-05-14/ar_logo-5df1c108110bd9379c9bfe078c26fb46.jpg" alt="Pipenv and DVC: Reproducibility in Data Science"> </div> </a> </section> <p></p> <h2 id="nice-tweets" style="position:relative;">Nice tweets<a href="#nice-tweets" aria-label="nice tweets permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Last, here are some of our favorite tweets to read this past month:</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Data version control from <a href="https://twitter.com/DVCorg">@DVCorg</a> is one of the best new tools I've used in a while. Moving data via the cloud is just a push or pull command away. <br><br>Recommend for anyone who works on multiple machines or shares data with collaborators</p>— Liam Brannigan (@braaannigan) <a href="https://twitter.com/braaannigan/status/1257918525345234949">May 6, 2020</a></blockquote>  <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Getting around to learning <a href="https://twitter.com/DVCorg">@DVCorg</a>, and loving it so far. Versioning data with git-style semantics gives you a lot of functionality with surprisingly little cognitive overhead.</p>— Tim Garvin (@tcgarvin) <a href="https://twitter.com/tcgarvin/status/1258855168436813826">May 8, 2020</a></blockquote> <p><em>Thank you, thank you very much.</em></p> <p><img src="https://media.giphy.com/media/gJ2sDSKAQHUCIYUFhx/giphy.gif" alt="Thank You Very Much GIF by The Wiggles"></p> <p>As always, we want to hear what you're making with DVC and what you're reading. Tell us in the blog comments, and be in touch on <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. Happy coding!</p>https://dvc.org/blog/dvc-ambassador-program-announcementhttps://dvc.org/blog/dvc-ambassador-program-announcementFri, 08 May 2020 00:00:00 GMT<p>DVC's software can be everywhere, but its developers can’t - that’s why ambassadors, folks who do outreach and community building around projects they love, are a key part of the open source community. DVC is starting an ambassador program to help people who are passionate about our mission get involved.</p> <p>As the first DVC ambassador, and a <a href="https://fedoraproject.org/wiki/User:Mribeirodantas" target="_blank" rel="nofollow noopener noreferrer">Fedora ambassador</a> before that, I can tell you a bit about the role. As a representative of open source projects, I've participated in lots of events, made friends, and traveled. Every single time I’ve contributed, I got this nice feeling that it was all worth it. I believe that if you agree with the core values of the project, a great relationship lies ahead :).</p> <p>So what are the core values of DVC, exactly? DVC is founded on the principle of engineering solutions for making data science and machine learning rigorous and reproducible. If this matters to you, too, you might be a good fit for our ambassador program!</p> <p>As an ambassador, you’ll act as a bridge between DVC in your community. There are lots of ways to do this, big and small. For example:</p> <ul> <li>Write a blog post talking about how you use DVC in your projects</li> <li>What about creating a network of DVC users and data scientists in your town? Even though we’re self-isolating now, you can still organize online meetups. <a href="https://tulu.la/events/dvc-virtual-meetup-2020-00032c" target="_blank" rel="nofollow noopener noreferrer">We already did one!</a> We help cover costs to organize meetups.</li> <li>Do you want to talk about DVC at your office, or at a conference? We help speakers develop talks, and we have some discretionary funds for travel on a case-by-case basis.</li> <li>Want to develop a feature for DVC? We welcome contributions to the code base, even if it’s your first pull request ever.</li> </ul> <p>Being an ambassador means getting closer to the team in charge of DVC, but at the same time, it means going farther to reach people outside the organization- including people who don’t know about DVC yet, people who need some help getting started, and people who are already excited about our mission and want to find meaningful ways to pitch in.</p> <h2 id="about-iterative-and-dvc" style="position:relative;">About Iterative and DVC<a href="#about-iterative-and-dvc" aria-label="about iterative and dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC got started in 2017 as a personal project by Dmitry Petrov ( <a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">we just celebrated our 3rd birthday</a>). Previously, Dmitry worked at Microsoft as a data scientist and did a PhD in Computer Science. In 2018, Dmitry teamed up with his co-founder Ivan Shcheklein (co-founder of <a href="https://tweetedtimes.com/" target="_blank" rel="nofollow noopener noreferrer">The Tweeted Times</a> and <a href="https://www.sedna.org/" target="_blank" rel="nofollow noopener noreferrer">Sedna</a> contributor) to incorporate Iterative.ai and grow the project. Iterative.ai is building enterprise tools for collaboration on ML projects. Currently, Iterative.ai's open source flagship project is Data Version Control (DVC), an open source version control system for managing complex workflows, datasets, and models.</p> <p>Development is ongoing in the core DVC project as well as new ventures into <a href="https://dvc.org/blog/reimagining-devops-video" target="_blank" rel="nofollow noopener noreferrer">MLOps and Continuous Integration & Delivery (CI/CD)</a> for data science. The team is small-and-mighty, with developers, engineers, and data scientists on four continents. The open source community is a huge part of all Iterative.ai projects; currently, DVC has more than <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">5,000 stars on GitHub</a> and more than 100 individual contributors!</p> <p>One of DVC’s main principles is adapting existing software engineering practices to machine learning. For example, DVC is built around Git version control: in an ML project using DVC, each experiment corresponds to a Git commit. When you check out any commit, you’ll see the source code as it was when you made the commit- as expected. But, you’ll also see your datasets as they were and the exact pipeline of commands you ran in that experiment!</p> <h2 id="why-become-an-ambassador" style="position:relative;">Why become an ambassador?<a href="#why-become-an-ambassador" aria-label="why become an ambassador permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Like any volunteer position, the main benefit is getting to be involved in a project you believe in. But there are some perks:</p> <ul> <li>Establishing a formal relationship with DVC that can go on your CV/resume. We'll boost your content on our social channels, too.</li> <li>Access to support from the DVC team, such as financial resources to organize your own meetup for local data scientists and ML enthusiasts</li> <li>Mentorship about crafting blogs and talks, if desired. DVC team members regularly help people in the community develop their presentations and blogs for accuracy and clarity</li> <li>Closer relationships with the DVC team means more chances to participate in conversations that guide our product decisions.</li> </ul> <p>For students and early career professionals, you can learn a lot by interacting with us! While you can certainly write a blog post or organize a meetup without being an ambassador, the program is a way to fast-track your learning- you'll have the creators of DVC helping you understand it well, and helping you discover features and best practices you might not have known about.</p> <p>If you're already active in the open source or MLOps community, then becoming an ambassador is a solid way to cement your relationship with DVC. We'd love to recognize you for the amazing stuff you already do.</p> <h2 id="how-to-become-an-ambassador" style="position:relative;">How to become an ambassador<a href="#how-to-become-an-ambassador" aria-label="how to become an ambassador permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you’re interested in becoming an ambassador, send us an email at <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">[email protected]</a> with the subject line “I want to be an ambassador!” Please tell us:</p> <ul> <li>A little about yourself and your professional background</li> <li>Any outreach work you’ve done before</li> <li>What kind of ambassador activities you’d be most interested in participating in</li> </ul> <p>The program is structured to provide a lot of flexibility, so each ambassador can do outreach in ways that are personally motivating and enjoyable. There are a few guidelines:</p> <ul> <li>We ask for at least one-year commitment</li> <li>We ask ambassadors to contribute at least four activities per year, about once every three months. There's no upper limit to how much you can do!</li> <li>For your first contribution, we ask for a blog post- this way, we can collaborate with you to help get all the technical details right. After that, it’s up to you!</li> </ul> <h2 id="some-ideas-to-get-started" style="position:relative;">Some ideas to get started<a href="#some-ideas-to-get-started" aria-label="some ideas to get started permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our official ambassador program is just starting, but our community already has a lot of folks making noise. Here are just a few contributions we admire- we think they’re pretty cool inspirations for future projects.</p> <h3 id="blogs-and-tutorials" style="position:relative;">Blogs and tutorials<a href="#blogs-and-tutorials" aria-label="blogs and tutorials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Shareable blogs are one of our most effective outreach strategies. They give visibility to the author <em>and</em> new ways to use DVC, so it's a win-win.</p> <ul> <li><a href="https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/" target="_blank" rel="nofollow noopener noreferrer">Remote training with GitLab-CI and DVC</a>, by Mercel Mikl and Bert Besser (Bert has also organized a DVC meetup in Berlin)</li> <li><a href="https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34" target="_blank" rel="nofollow noopener noreferrer">Creating a solid Data Science development environment</a>, by Gabriel dos Santos Goncalves</li> <li><a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">Continuous Delivery for Machine Learning</a>, by Danilo Sato, Arif Wider, and Christoph Windheuser</li> <li><a href="https://mribeirodantas.xyz/blog/index.php/2020/03/05/r-dvc-and-rmarkdown/" target="_blank" rel="nofollow noopener noreferrer">Manage your Data Science Project in R</a> was my first blog post about using DVC in an R project!</li> </ul> <h3 id="talks" style="position:relative;">Talks<a href="#talks" aria-label="talks permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Community members have presented at events like PyCon, PyData, and local meetups.</p> <ul> <li><a href="https://www.slideshare.net/AlessiaMarcolini/version-control-for-data-science" target="_blank" rel="nofollow noopener noreferrer">Version control for data science</a>, by Alessia Marcolini @ PyCon DE & PyData Berlin</li> <li><a href="https://www.youtube.com/watch?v=rUTlqpcmiQw" target="_blank" rel="nofollow noopener noreferrer">How to easily set up and version control your machine learning pipelines</a>, by Sarah Diot-Girard & Stephanie Bracaloni @ PyData Amsterdam</li> <li><a href="https://speakerdeck.com/kurianbenoy/ml-models-and-dataset-versioning" target="_blank" rel="nofollow noopener noreferrer">ML models and dataset versioning</a>, by Kurian Benoy @ PyCon India</li> </ul> <h3 id="code-contributions" style="position:relative;">Code contributions<a href="#code-contributions" aria-label="code contributions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Our GitHub repository has lots of open discussions about potential features- its a goldmine for ways to pitch in. For example:</p> <ul> <li> <p><a href="https://github.com/elgehelge" target="_blank" rel="nofollow noopener noreferrer">Helge Munk Jacobsen</a> took on an open issue in our code base about supporting hyperparameter tracking with DVC and made a pull request to add this feature.</p> </li> <li> <p><a href="https://github.com/verasativa/" target="_blank" rel="nofollow noopener noreferrer">Vera Sativa</a> added directory support to the <a href="https://dvc.org/doc/command-reference/import-url"><code>dvc import-url</code></a> function- and she was our 100th contributor, so she won her own DeeVee the owl.</p> </li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/78b685e283d679c8ebe518ea17520f6d/39600/odd_with_deevee.png" alt="odd with deevee" title="Vera and team" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Vera (center, flashing a peace sign) thanked us with this lovely picture of DeeVee and her team, <a href="https://odd.co" target="_blank" rel="nofollow noopener noreferrer">Odd Industries</a>.</em></p> <p>If any of this sounds fun to you, please be in touch over <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email</a> (and you can also reach us on <a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> and our <a href="https://discordapp.com/invite/dvwXA2N" target="_blank" rel="nofollow noopener noreferrer">Discord Channel</a>). We look forward to connecting with you!</p>https://dvc.org/blog/dvc-3-years-and-1-0-releasehttps://dvc.org/blog/dvc-3-years-and-1-0-releaseMon, 04 May 2020 00:00:00 GMT<h2 id="3-years-anniversary" style="position:relative;">3 years anniversary!<a href="#3-years-anniversary" aria-label="3 years anniversary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Three years ago on <strong>May 4th, 2017</strong>, I published the <a href="https://www.kdnuggets.com/2017/05/data-version-control-iterative-machine-learning.html" target="_blank" rel="nofollow noopener noreferrer">first blog post about DVC</a>. <a href="https://www.reddit.com/r/Python/comments/698ian/dvc_data_scientists_collaboration_and_iterative/" target="_blank" rel="nofollow noopener noreferrer">The first DVC discussion on Reddit</a>. Until that point, DVC was a private project between <a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">myself</a> and <a href="https://github.com/efiop" target="_blank" rel="nofollow noopener noreferrer">Ruslan</a>. Today, things look very different.</p> <p>Today, DVC gets recognized at professional conferences: people spot our logo, and sometimes even our faces, and want to chat. There's much more content about DVC coming from bloggers than from inside our organization. We're seeing more and more job postings that list DVC as a requirement, and we're showing up in <a href="https://www.amazon.com/Learn-Python-Building-Science-Applications/dp/1789535360" target="_blank" rel="nofollow noopener noreferrer">data science textbooks</a>. When we find a new place DVC is mentioned, we celebrate in our Slack - we've come a long way!</p> <p>The data science and ML space is fast-paced and vibrant, and we're proud that DVC is making an impact on discussions about best practices for healthy, sustainable ML. Every week, we chat with companies and research groups using DVC to make their teams more productive. We're proud to be part of the growing MLOps movement: so far, a majority of CI/CD for ML projects are implemented with DVC under the hood.</p> <p>I can confidently say that DVC wouldn't have been possible without a lot of help from our community. Thank you to everyone who has supported us:</p> <p><strong>DVC core team.</strong> The DVC team has been the force driving our project's evolution - we've grown from 2 to 12 full-time engineers, developers, and data scientists. Half of the team is purely focus on DVC while the other half on related to DVC new projects. We often get feedback about how fast our team answers user questions - we've been told our user support is one of DVC's "killer features". It's all thanks to this amazing team.</p> <p><strong>DVC contributors.</strong> As of today, the DVC code base has <a href="https://github.com/iterative/dvc/graphs/contributors" target="_blank" rel="nofollow noopener noreferrer">126 individual contributors</a>. Many of these folks put hours into their code contribution. We're grateful for their tenacity and generosity.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 156.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/5d069d099019190069a5e5789008af9f/87fcf/vera-sativa.png" alt="vera sativa" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Vera - 100th DVC contributor <a href="https://github.com/verasativa/" target="_blank" rel="nofollow noopener noreferrer">on GitHub</a>.</em></p> <p><strong>Documentation contributors.</strong> Another <a href="https://github.com/iterative/dvc.org/graphs/contributors" target="_blank" rel="nofollow noopener noreferrer">124 people contributed</a> to the <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC documentation</a> and <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">the website</a>. Every time a new person tries out DVC, they benefit from the hard work that's gone into our docs.</p> <p><strong>Active community members.</strong> Active DVC users help our team understand and better anticipate their needs and identify priorities for development. They share bright ideas for new features, locate and investigate bugs in code, and welcome and support new users.</p> <p><strong>People who give DVC a shot.</strong> Today, there are thousands of data scientists, ML engineers, and developers using DVC on a regular basis. The number of users is growing every week. Our <a href="http://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a> has almost two thousand users. Hundreds more connect with us through email and Twitter. To everyone willing to try out DVC, thank you for the opportunity.</p> <h2 id="dvc-10-is-the-result-of-3-years-of-learning" style="position:relative;">DVC 1.0 is the result of 3 years of learning<a href="#dvc-10-is-the-result-of-3-years-of-learning" aria-label="dvc 10 is the result of 3 years of learning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>All these contributions, big and small, have a collective impact on DVC's development. I'm happy (and a bit nervous) to announce that a pre-release of a brand new DVC 1.0 is ready for public beta testing.</p> <p>You can install the 1.0 pre-release from the master branch in our repo (instruction <a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">here</a>) or through pip:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip</span> <span class="token function">install</span> <span class="token parameter variable">--upgrade</span> <span class="token parameter variable">--pre</span> dvc</span></code></pre></div> <p>The new DVC is inspired by discussions and contributions from our community - both fresh ideas and bug reports 😅.</p> <p>Here are the most significant features we’re excited to be rolling out soon:</p> <h3 id="run-cache" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1234" target="_blank" rel="nofollow noopener noreferrer">Run cache</a><a href="#run-cache" aria-label="run cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><em>Learnings:</em> Forcing users to make Git commits for each ML experiment creates too much overhead.</p> <p>DVC 1.0 has a "long memory" of DVC commands runs. This means it can identify if a <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> has already been run and save compute time by returning the cached result - <em>even if you didn't Git commit that past run</em>.</p> <p>We added the run-cache with CI/CD systems and other MLOps and DataOps automation tools in mind. No more auto-commits needed after <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> in the CI/CD system side.</p> <h3 id="multi-stage-dvc-files" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/1871" target="_blank" rel="nofollow noopener noreferrer">Multi-stage DVC files</a><a href="#multi-stage-dvc-files" aria-label="multi stage dvc files permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><em>Learnings:</em> ML pipelines evolve much faster than data engineering pipelines.</p> <p>We redesigned the way DVC records data processing stages with metafiles, to make pipelines more interpretable and editable. All pipeline stages are now saved in a single metafile, with all stages stored together instead of in separate files.</p> <p>Data hash values are no longer stored in the pipeline metafile. This improves human-readability.</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">stages</span><span class="token punctuation">:</span> <span class="token key atrule">process</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> ./process_raw_data raw_data.log users.csv <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> raw_data.log <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> process_file <span class="token punctuation">-</span> click_threshold <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> users.csv <span class="token key atrule">train</span><span class="token punctuation">:</span> <span class="token key atrule">cmd</span><span class="token punctuation">:</span> python train.py <span class="token key atrule">deps</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> users.csv <span class="token key atrule">params</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> epochs <span class="token punctuation">-</span> log_file <span class="token punctuation">-</span> dropout <span class="token key atrule">metrics_no_cache</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> summary.json <span class="token key atrule">metrics</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> logs.csv <span class="token key atrule">outs</span><span class="token punctuation">:</span> <span class="token punctuation">-</span> model.pkl</code></pre></div> <h3 id="plots" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3409" target="_blank" rel="nofollow noopener noreferrer">Plots</a><a href="#plots" aria-label="plots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><em>Learnings:</em> Versioning metrics and plots are no less important than data versioning.</p> <p>Countless users asked us when we'd support metrics visualizations. Now it's here: DVC 1.0 introduces metrics file visualization commands, <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff</code></a> and <a href="https://dvc.org/doc/command-reference/plots/show"><code>dvc plots show</code></a>. DVC plots are powered by the <a href="https://vega.github.io/vega-lite/" target="_blank" rel="nofollow noopener noreferrer">Vega-Lite</a> graphic library.</p> <p>This function is designed not only for showing visualizations based on the current state of your project, but it can also combine multiple plots from your Git history in a single chart so you can compare results across commits. Users can visualize how, for example, their model accuracy in the latest commit differs from another commit (or even multiple commits).</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march </span>file:///Users/dmitry/src/plot/logs.csv.html <span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div> <p><img src="https://dvc.org/2020-05-04/dvc-plots-092248e6898ab510fc3803efb5e22d9f.svg" alt=""></p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc plots diff</span> <span class="token parameter variable">-d</span> logs.csv HEAD HEAD^ d1e4d848 baseline_march <span class="token punctuation">\</span> <span class="token parameter variable">-x</span> loss <span class="token parameter variable">--template</span> scatter </span>file:///Users/dmitry/src/plot/logs.csv.html <span class="token line"><span class="token input">$ </span><span class="token command">open</span> logs.csv.html</span></code></pre></div> <p><img src="https://dvc.org/2020-05-04/dvc-plots-scatter-9cfc6c2078273faa482129d8d1609967.svg" alt=""></p> <h3 id="data-transfer-optimizations" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3488" target="_blank" rel="nofollow noopener noreferrer">Data transfer optimizations</a><a href="#data-transfer-optimizations" aria-label="data transfer optimizations permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><em>Learnings:</em> In ML projects, data transfer optimization is still the king.</p> <p>We've done substantial work on optimizing data management commands, such as <a href="https://dvc.org/doc/command-reference/pull#-c"><code>dvc pull / push / status -c / gc -c</code></a>. Now, based on the amount of data, DVC can choose an optimal data remote traversing strategy.</p> <p><a href="https://github.com/iterative/dvc/issues/2147" target="_blank" rel="nofollow noopener noreferrer">Mini-indexes</a> were introduced to help DVC instantly check data directories instead of iterating over millions of files. This also speeds up file adding/removing to large directories.</p> <p>More optimizations are included in the release based on performance bottlenecks we profiled. More detailed <a href="https://gist.github.com/pmrowla/338d9645bd05df966f8aba8366cab308" target="_blank" rel="nofollow noopener noreferrer">benchmark report</a> that shows how many second it takes to run a specific commands on 2M images directory.</p> <p><img src="https://dvc.org/2020-05-04/benchmarks-fb3909a1a199bbfdfb5b66b689e2ffb0.svg" alt=""></p> <h3 id="hyperparameter-tracking" style="position:relative;"><a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">Hyperparameter tracking</a><a href="#hyperparameter-tracking" aria-label="hyperparameter tracking permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><em>Learnings:</em> ML pipeline steps depends only on a subset of config file.</p> <p>This feature was actually released in the last DVC 0.93 version (see <a href="https://dvc.org/doc/command-reference/params" target="_blank" rel="nofollow noopener noreferrer">params docs</a>. However, it is an important step to support configuration files and ML experiments in a more holistic way.</p> <h3 id="for-more-information-on-the-new-features" style="position:relative;">For more information on the new features…<a href="#for-more-information-on-the-new-features" aria-label="for more information on the new features permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Each of the big new features and improvements deserve a separate blog post. We will be posting more - please stay in touch.</p> <p>I hope our the most active users will find time to check the DVC pre-release version and provide their feedback. The installation instruction is <a href="https://dvc.org/doc/install/pre-release" target="_blank" rel="nofollow noopener noreferrer">on our website</a>.</p> <h2 id="5000-github-stars" style="position:relative;">5000 GitHub stars<a href="#5000-github-stars" aria-label="5000 github stars permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Activity on our GitHub page has grown organically since the DVC repo went public on May 4th, 2017. Coincidentally, today, in the 3rd year anniversary we have reached 5000 starts:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bf64a01a292055b72fcf916ef2d6d1f8/39600/5k_github.png" alt="5k github" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h2 id="thank-you" style="position:relative;">Thank you!<a href="#thank-you" aria-label="thank you permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thank you again to all DVC contributors, community members, and users. Every piece of your help is highly appreciated and will bring huge benefits to the entire ecosystem of data and ML projects.</p> <p>Stay healthy and safe, wherever you are in the world. And be in touch on <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/gsod-ideas-2020https://dvc.org/blog/gsod-ideas-2020Thu, 30 Apr 2020 00:00:00 GMT<p>After a successful experience with the first edition of <strong>Google Season of Docs</strong> <a href="https://dvc.org/blog/dvc-project-ideas-for-google-summer-of-docs-2019">in 2019</a>, we're putting out a call for writers to apply to work with DVC as part of the <a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">2020 edition</a>. If you want to write open source software documentation with mentorship from our team, read on.</p> <p><strong>TLDR</strong>: Skip to <a href="#project-ideas">project ideas</a>.</p> <p><a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC</a> has a dedicated docs team and a <a href="https://dvc.org/doc/user-guide/contributing/docs" target="_blank" rel="nofollow noopener noreferrer">well-defined process</a> for creating and maintaining our documentation, modeled in part based on our past GSoD experience. We are happy to share our experience, introduce technical writers to the world of open source and machine learning best practices, and work together on improving our documentation.</p> <h2 id="previous-experience" style="position:relative;">Previous experience<a href="#previous-experience" aria-label="previous experience permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In last year's Season, we matched with prolific writer <a href="https://github.com/dashohoxha" target="_blank" rel="nofollow noopener noreferrer">Dashamir</a>, who helped us give proper structure important part of our docs, and address key issues.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">As <a href="https://twitter.com/hashtag/GSoD2019?src=hash&ref_src=twsrc%5Etfw">#GSoD2019</a> is officially over we would like to thank <a href="https://twitter.com/dashohoxha">@dashohoxha</a> for contributing online interactive tutorials <a href="https://t.co/iZKkqmx5pm">https://t.co/iZKkqmx5pm</a> (👈 link or search for Katacoda button on <a href="https://t.co/b8MwcZdY3s">https://t.co/b8MwcZdY3s</a>) 😍 Thank you <a href="https://twitter.com/GoogleOSS">@GoogleOSS</a> team and <a href="https://twitter.com/chenopis">@chenopis</a> for organizing this 🙏! <a href="https://t.co/SGrgtA5J0B">pic.twitter.com/SGrgtA5J0B</a></p>— 🦉DVC (@DVCorg) <a href="https://twitter.com/DVCorg/status/1205203662827483136">December 12, 2019</a></blockquote> <p>Some of our achievements together were:</p> <ul> <li>Reorganized our <a href="https://github.com/iterative/dvc.org/pull/666" target="_blank" rel="nofollow noopener noreferrer">tutorials</a> and core <a href="https://github.com/iterative/dvc.org/pull/726" target="_blank" rel="nofollow noopener noreferrer">contribution guide</a></li> <li>Created <a href="https://github.com/iterative/dvc.org/issues/546" target="_blank" rel="nofollow noopener noreferrer">interactive lessons</a> on <a href="https://www.katacoda.com/dvc" target="_blank" rel="nofollow noopener noreferrer">Katacoda</a></li> <li>Docs <a href="https://github.com/iterative/dvc.org/pull/734" target="_blank" rel="nofollow noopener noreferrer">cleanup</a></li> <li>Suggested the creation of a <a href="https://github.com/iterative/dvc.org/issues/563" target="_blank" rel="nofollow noopener noreferrer">How To</a> section for our docs</li> <li>Other <a href="https://github.com/iterative/dvc.org/pulls?q=is%3Apr+is%3Aclosed+author%3Adashohoxha" target="_blank" rel="nofollow noopener noreferrer">contributions</a></li> </ul> <p>Another collaborator we connected with via GSoD’19 was an amazing student intern, <a href="https://github.com/algomaster99" target="_blank" rel="nofollow noopener noreferrer">Aman</a>. He helped us address <a href="https://github.com/iterative/dvc.org/pulls?q=is%3Apr+author%3Aalgomaster99+is%3Aclosed" target="_blank" rel="nofollow noopener noreferrer">dozens of tickets</a> related to our Node.js docs web app. For example:</p> <ul> <li> <p>Contributed to our <a href="https://github.com/iterative/dvc.org/pull/315" target="_blank" rel="nofollow noopener noreferrer">command reference</a> and <a href="https://github.com/iterative/dvc.org/pull/366" target="_blank" rel="nofollow noopener noreferrer">user guide</a>, and created a much needed <a href="https://github.com/iterative/dvc.org/pull/317" target="_blank" rel="nofollow noopener noreferrer">documentation contribution</a> guide</p> </li> <li> <p><a href="https://github.com/iterative/dvc.org/pull/328" target="_blank" rel="nofollow noopener noreferrer">Formatted</a> the source code of our docs and established an <a href="https://github.com/iterative/dvc.org/pull/386" target="_blank" rel="nofollow noopener noreferrer">automated mechanism</a> to enforce pretty formatting going forward</p> </li> <li> <p>Implemented super useful hovering tooltips based on a special <a href="https://github.com/iterative/dvc.org/pull/431" target="_blank" rel="nofollow noopener noreferrer">glossary</a>:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 595px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/66e1324a2a8352f1b3605e3fa6b90731/39600/tooltip.png" alt="tooltip" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Toolip in the <a href="https://dvc.org/doc/command-reference/remote"><code>dvc remote</code></a> command reference</em></p> </li> </ul> <h3 id="community-outreach" style="position:relative;">Community outreach<a href="#community-outreach" aria-label="community outreach permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>More positive results of the program included talks and meetups organized by our open source contributors, with our mentorship:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 604.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/263792dadd1fa5d01ed810e1d7a09bb8/39600/SciPy_India_Aman.png" alt="SciPy India Aman" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Our intern Aman took a several-hour long train ride to <a href="https://static.fossee.in/scipy2019/SciPyTalks/SciPyIndia2019%5FS011%5FStoring%5Fa%5Ffew%5Fversions%5Fof%5Fa%5F5GB%5Ffile%5Fin%5Fa%5Fdata%5Fscience%5Fproject%5F20191130.mp4" target="_blank" rel="nofollow noopener noreferrer">talk</a> at <a href="https://scipy.in/2019" target="_blank" rel="nofollow noopener noreferrer">SciPy India 2019</a>.</em></p> <p>Another star contributor who found our project via GSoD, <a href="https://github.com/kurianbenoy" target="_blank" rel="nofollow noopener noreferrer">Kurian</a>, closed <a href="https://github.com/iterative/dvc.org/issues?q=is%3Aissue+kurianbenoy" target="_blank" rel="nofollow noopener noreferrer">several tickets</a>, produced a DVC intro tutorial in <a href="https://www.kaggle.com/kurianbenoy/introduction-to-data-version-control-dvc" target="_blank" rel="nofollow noopener noreferrer">Kaggle</a> and <a href="https://colab.research.google.com/drive/1O1XmUZ8Roj1dFxWTrpE55_A7lVkWfG04" target="_blank" rel="nofollow noopener noreferrer">Colab</a>, and ended up giving a talk in <a href="https://in.pycon.org/cfp/2019/proposals/machine-learning-model-and-dataset-versioning~dRqRb/" target="_blank" rel="nofollow noopener noreferrer">PyCon India</a>:</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Ipzf6oQqQpo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>He also covered DVC for the <a href="https://kurianbenoy.github.io/2019-11-03-Devsprints%5Fexperience/" target="_blank" rel="nofollow noopener noreferrer">Devsprints</a> of <a href="https://enotice.vtools.ieee.org/public/50448" target="_blank" rel="nofollow noopener noreferrer">MEC.conf</a></p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Thank you <a href="https://twitter.com/DVCorg">@DVCorg</a> for participating in the Devsprints, by <a href="https://twitter.com/FossMec">@FossMEC</a> and <a href="https://twitter.com/excelmec">@excelmec</a>. We had <a href="https://twitter.com/shcheklein">@shcheklein</a> who joined us all the way from SF and explained how open source is boosting the future. Srinidhi and <a href="https://twitter.com/kurianbenoy2">@kurianbenoy2</a> helped participants get started to contributing to the project.</p>— FOSS MEC (@FossMec) <a href="https://twitter.com/FossMec/status/1192866498324254720">November 8, 2019</a></blockquote> <p>Yet another outstanding contributor, <a href="https://twitter.com/explorer_07" target="_blank" rel="nofollow noopener noreferrer">Nabanita</a>, ended up organizing a DVC-themed hackathon later that year:</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Our open source event Hacktoberfest-themed meet-up was a success. Thanks to <a href="https://twitter.com/DVCorg">@DVCorg</a> and it's mentors for all the hard work. <br>Some of our attendees made their first PR on DVC and got them merged. Kudos to the team! <br>PS: 🍕 was the second best thing of the evening. <a href="https://t.co/zAWC0TVlPd">pic.twitter.com/zAWC0TVlPd</a></p>— Programming Society IIIT-Bh (@psociiit) <a href="https://twitter.com/psociiit/status/1185150096792535040">October 18, 2019</a></blockquote> <h2 id="prerequisites-to-apply" style="position:relative;">Prerequisites to apply<a href="#prerequisites-to-apply" aria-label="prerequisites to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Besides the general requirements to apply to Google Season of Docs, there are a few skills we look for in applicants.</p> <ol> <li> <p><strong>Clear English writing.</strong> We strive express the concepts, processes, and details around DVC clearly, correctly, and completely. We use general and friendly wording as much as possible and pay close attention to consistency in our terminology. Our team will help with copy editing.</p> </li> <li> <p><strong>Command line experience.</strong> <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">DVC</a> is a command line tool that builds on top of <a href="https://git-scm.com/" target="_blank" rel="nofollow noopener noreferrer">Git</a>, so being able to play with it and test the features will be very useful. Creating and managing files, GNU/Linux commands, file and permission administration are desired skills.</p> </li> <li> <p><strong>People skills.</strong> We put a high value on communication: the ability to discuss ideas, explain your goals, report progress, and work kindly with more or less technical teammates.</p> </li> </ol> <p>If you like our mission but aren't sure if you're sufficiently prepared, please be in touch anyway. We'd love to hear from you.</p> <h2 id="project-ideas" style="position:relative;">Project ideas<a href="#project-ideas" aria-label="project ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Below are several project ideas that are an immediate priority for the DVC docs team. We welcome technical writers to create their own proposals, even if they differ from our ideas. Most projects will be mentored primarily by our lead technical writer, <a href="https://github.com/jorgeorpinel" target="_blank" rel="nofollow noopener noreferrer">Jorge</a>.</p> <ol> <li> <p><strong>"How To" section.</strong> Other than our <a href="https://dvc.org/doc/use-cases" target="_blank" rel="nofollow noopener noreferrer">use cases</a>, we still lack a good place to answer common questions in our docs (think FAQ). We have compiled <a href="https://github.com/iterative/dvc.org/issues/899" target="_blank" rel="nofollow noopener noreferrer">set of topics</a> that we think would be best explained in a new <strong>How To</strong> section for this purpose.</p> <p>This project would imply relocating bits and pieces of info from existing docs into new how-tos, as well as writing significant new material to complete them. Expanding on our <a href="https://dvc.org/doc/user-guide/troubleshooting" target="_blank" rel="nofollow noopener noreferrer">troubleshooting</a> page would probably go well as part of this project as well.</p> <p><em>Difficulty rating:</em> Beginner-Medium<br><br></p> </li> <li> <p><strong>DVC 1.0 docs.</strong> We are soon to release DVC 1.0.0! This version brings some significant changes that for the first time in our <a href="https://github.com/iterative/dvc/releases" target="_blank" rel="nofollow noopener noreferrer">release history</a> are not completely backward-compatible. We expect that fully updating all our previous docs will take a few months, and you could help us with this! The main new features are listed below.</p> <blockquote> <p>UPDATE: See <a href="https://dvc.org/blog/dvc-3-years-and-1-0-release" target="_blank" rel="nofollow noopener noreferrer">post</a> about the release! And corresponding docs <a href="https://github.com/iterative/dvc.org/issues/1255" target="_blank" rel="nofollow noopener noreferrer">epic</a> task</p> </blockquote> <ul> <li>A <a href="https://github.com/iterative/dvc/issues/1871" target="_blank" rel="nofollow noopener noreferrer">multi-stage <em>pipelines file</em></a> that partially substitutes <a href="https://dvc.org/doc/user-guide/dvc-files" target="_blank" rel="nofollow noopener noreferrer">DVC files</a></li> <li>Separation between <a href="https://github.com/iterative/dvc/issues/3409" target="_blank" rel="nofollow noopener noreferrer">scalar vs. continuous metrics</a>, and new commands to visualize them, such as <a href="https://dvc.org/doc/command-reference/plots"><code>dvc plots</code></a></li> <li>A new <a href="https://github.com/iterative/dvc/issues/1234" target="_blank" rel="nofollow noopener noreferrer">run cache</a> that automatically saves experiment checkpoints between commits</li> </ul> <p><em>Difficulty rating:</em> Beginner-Medium<br><br></p> </li> <li> <p><strong>Video tutorials.</strong> Written documentation is great, but other media can also be important for our organization to reach a wide variety of learners. Expanding to video is also a core part of our developer advocacy strategy.</p> <p>One of DVC's priorities for this year is creating a library of video tutorials ranging from short explanations of basic DVC functions to more advanced use cases. You could assist in writing the scripts or even take the lead producing some videos, so image/video editing skills would come in handy (optional).</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/70a290d5570858cf3528fbe72f6070a9/39600/Discord_user_video_tutorials.png" alt="Discord user video tutorials" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Video tutorials are a common request by users in our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">chat</a>.</em></p> <p><strong>Mentor</strong>: <a href="https://github.com/elleobrien" target="_blank" rel="nofollow noopener noreferrer">Elle</a></p> <p><em>Difficulty rating:</em> Beginner-Medium<br><br></p> </li> <li> <p><strong>Interactive guides.</strong> Many of our docs include command line examples to illustrate how DVC works. In some cases these are full guides we want people to be able to follow by copying commands into their terminals. This has a few challenges: mainly keeping the rest of the document maintainable, brief, and easy to read; and supporting people on all platforms (Mac, Windows, Linux).</p> <p>So we started extracting some of the command examples into interactive <a href="https://www.katacoda.com/dvc" target="_blank" rel="nofollow noopener noreferrer">Katacoda scenarios</a> to match certain docs, however they are in need of maintenance and completion, as well as being embedded into the corresponding pages per <a href="https://github.com/iterative/dvc.org/issues/670" target="_blank" rel="nofollow noopener noreferrer">this issue</a>.</p> <p>This may involve working with our front-end team or, preferably, having some Javascript coding experience.</p> <p><em>Difficulty rating:</em> Medium-Advanced</p> </li> <li> <p><strong>Javascript engine UI/UX.</strong> Our website has custom <a href="https://github.com/iterative/dvc.org/tree/main/src" target="_blank" rel="nofollow noopener noreferrer">source code</a> we've developed over the years to host our landing pages, docs, and blog all in a high-performance, advanced static site (Node.js, Gatsby, React, Typescript). We have several goals to further improve the usability and structure of our site, such as:</p> <ul> <li>Creating a <a href="https://github.com/iterative/dvc.org/issues/1073" target="_blank" rel="nofollow noopener noreferrer">special docs home page</a></li> <li>Improving <a href="https://github.com/iterative/dvc.org/issues/808" target="_blank" rel="nofollow noopener noreferrer">mobile menus</a></li> <li>Better navigation sidebar <a href="https://github.com/iterative/dvc.org/issues/753" target="_blank" rel="nofollow noopener noreferrer">highlighting</a> and <a href="https://github.com/iterative/dvc.org/issues/1198" target="_blank" rel="nofollow noopener noreferrer">positioning</a></li> <li>Other <a href="https://github.com/iterative/dvc.org/issues?q=is%3Aopen+is%3Aissue+label%3Adoc-engine" target="_blank" rel="nofollow noopener noreferrer">doc-engine</a> and <a href="https://github.com/iterative/dvc.org/issues?q=is%3Aopen+is%3Aissue+label%3Ablog-engine" target="_blank" rel="nofollow noopener noreferrer">blog-engine</a> issues</li> </ul> <p><em>Difficulty rating:</em> Medium-Advanced<br><br></p> </li> <li> <p><strong>SEO/ Site Analytics.</strong> Our current website analytics are somewhat basic. We will need to have a clear strategy to follow and improve our Search Engine results (with meta content, media optimization, <a href="https://github.com/iterative/dvc.org/issues?q=is%3Aissue+is%3Aopen+seo" target="_blank" rel="nofollow noopener noreferrer">etc.</a>), as well as to understand the behavior of our users to improve their experience. The specifics of the project are left for the applicant to suggest! This should be relatively simple for someone with proven experience in SEO or website QA.</p> <p>What tools should we employ? (e.g. Google Analytics, etc.) What trends and reports do we need to focus on? What kinds of users do we have and what interaction flows do they each follow? Can we semi-identify these users and/or cross-examine their data with DVC <a href="https://dvc.org/doc/user-guide/analytics" target="_blank" rel="nofollow noopener noreferrer">usage analytics</a>? Let's come up with a plan to answer these and other related questions!</p> <p><em>Difficulty rating:</em> Beginner-Medium<br><br></p> </li> </ol> <blockquote> <p>For more inspiration, feel free to review our <a href="https://github.com/iterative/dvc.org/labels/epic" target="_blank" rel="nofollow noopener noreferrer">epics</a> and other open docs <a href="https://github.com/iterative/dvc.org/issues?q=is%3Aopen+is%3Aissue+label%3Adoc-content+" target="_blank" rel="nofollow noopener noreferrer">issues</a>.</p> </blockquote> <h2 id="if-youd-like-to-apply" style="position:relative;">If you'd like to apply<a href="#if-youd-like-to-apply" aria-label="if youd like to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Please refer to the <a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a> application guides for specifics of the program. Writers looking to know more about DVC, and our worldwide community of contributors, will learn most by visiting our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord chat</a>, <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub repository</a>, and <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Forum</a>. We are available to discuss project proposals from interested writers and can be reached by <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email</a> or on Discord.</p>https://dvc.org/blog/april-20-community-gemshttps://dvc.org/blog/april-20-community-gemsThu, 16 Apr 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p> <h3 id="q-how-can-i-view-and-download-files-that-are-being-tracked-by-dvc-in-a-repository" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/698815826870009868" target="_blank" rel="nofollow noopener noreferrer">How can I view and download files that are being tracked by DVC in a repository?</a><a href="#q-how-can-i-view-and-download-files-that-are-being-tracked-by-dvc-in-a-repository" aria-label="q how can i view and download files that are being tracked by dvc in a repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To list the files that are currently being tracked in a project repository by DVC and Git, you can use <a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a>. This will display the contents of that repository, including <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files. To download the contents corresponding to a particular <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file, use <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a>:</p> <p>Let's consider an example using both functions. Assume we're working with DVC's data registry example repository. To list the files present, run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc list</span> <span class="token parameter variable">-R</span> https://github.com/iterative/dataset-registry </span>.gitignore README.md get-started/.gitignore get-started/data.xml get-started/data.xml.dvc ...</code></pre></div> <p>Note that the <code>-R</code> flag, which enables <a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a> to display the contents of directories inside the repository. Now assume you want to download <code>data.xml</code>, which we can see is being tracked by DVC. To download the dataset to your local workspace, you would then run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> https://github.com/iterative/dataset-registry get-started/data.xml</span></code></pre></div> <p>For more examples and information, <a href="https://dvc.org/doc/command-reference/list#list" target="_blank" rel="nofollow noopener noreferrer">see the documents</a> for <a href="https://dvc.org/doc/command-reference/list"><code>dvc list</code></a> and for <a href="https://dvc.org/doc/command-reference/get" target="_blank" rel="nofollow noopener noreferrer"><code>dvc get</code></a>.</p> <h3 id="q-im-setting-up-cloud-remote-storage-for-dvc-and-id-like-to-forbid-dvc-gc---cloud-so-users-cant-accidently-delete-files-in-the-remote-will-it-be-sufficient-to-restrict-deletion-in-the-remotes-settings" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/698116671298076672" target="_blank" rel="nofollow noopener noreferrer">I'm setting up cloud remote storage for DVC and I'd like to forbid <code>dvc gc --cloud</code> so users can't accidently delete files in the remote. Will it be sufficient to restrict deletion in the remote's settings?</a><a href="#q-im-setting-up-cloud-remote-storage-for-dvc-and-id-like-to-forbid-dvc-gc---cloud-so-users-cant-accidently-delete-files-in-the-remote-will-it-be-sufficient-to-restrict-deletion-in-the-remotes-settings" aria-label="q im setting up cloud remote storage for dvc and id like to forbid dvc gc cloud so users cant accidently delete files in the remote will it be sufficient to restrict deletion in the remotes settings permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You're right to be careful, because <a href="https://dvc.org/doc/command-reference/gc#--cloud"><code>dvc gc --cloud</code></a> can be dangerous in the wrong hands- it'll remove any unused files in your remote (for more info, <a href="https://dvc.org/doc/command-reference/gc#gc" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>). To prevent users from having this power, setting your bucket policy to block object deletions should do the trick. How to do this will depend on your cloud storage provider- we found some relevant docs for <a href="https://cloud.google.com/iam/docs/understanding-roles#cloud_storage_roles" target="_blank" rel="nofollow noopener noreferrer">GCP</a>, <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html" target="_blank" rel="nofollow noopener noreferrer">S3</a>, and <a href="https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad" target="_blank" rel="nofollow noopener noreferrer">Azure</a>. For the full list of supported remote storage types, <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">see here</a>.</p> <h3 id="q-my-team-is-interested-in-dvc-and-we-have-all-of-our-data-in-remote-storage-do-we-need-to-install-a-centralised-enterprise-version-of-dvc-on-a-dedicated-server-and-do-we-have-to-also-have-a-github-repository" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/692524884701478992" target="_blank" rel="nofollow noopener noreferrer">My team is interested in DVC, and we have all of our data in remote storage. Do we need to install a centralised enterprise version of DVC on a dedicated server? And do we have to also have a GitHub repository?</a><a href="#q-my-team-is-interested-in-dvc-and-we-have-all-of-our-data-in-remote-storage-do-we-need-to-install-a-centralised-enterprise-version-of-dvc-on-a-dedicated-server-and-do-we-have-to-also-have-a-github-repository" aria-label="q my team is interested in dvc and we have all of our data in remote storage do we need to install a centralised enterprise version of dvc on a dedicated server and do we have to also have a github repository permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There's no need for a DVC server. Our remote storage works on top of <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">most kinds of cloud storage by default</a>, including S3, GCP, Azure, Google Drive, and Aliyun, with no additional infrastructure required. As for GitHub (or BitBucket, or GitLab, etc.), this is only needed if you're interested in sharing your project with others over that channel. We <em>like</em> sharing projects on GitHub, but you don't have to. Any Git repository, even a local one, will do.</p> <p>So a "minimal" DVC project for you might consist of a local workspace with Git enabled (which you <em>do</em> need), a local Git repository, and your S3 remote storage. Check out our <a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files" target="_blank" rel="nofollow noopener noreferrer">use cases</a> to see some examples of infrastructure and workflow for teams.</p> <h3 id="q-could-there-be-any-issues-with-concurrent-dvc-push-es-to-the-same-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/680053750320332800" target="_blank" rel="nofollow noopener noreferrer">Could there be any issues with concurrent <code>dvc push</code>-es to the same remote?</a><a href="#q-could-there-be-any-issues-with-concurrent-dvc-push-es-to-the-same-remote" aria-label="q could there be any issues with concurrent dvc push es to the same remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are a few ways for concurrency to occur: multiple jobs running in parallel on the same machine, or different users on different machines. But in any case, the answer is the same: there's nothing to worry about! When pushing a file to a DVC remote, all operations are non-destructive and atomic.</p> <h3 id="q-how-do-i-only-download-part-of-my-remote-repository-for-example-i-only-need-the-final-output-of-my-pipeline-not-the-raw-data-or-intermediate-steps" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/696751934777852004" target="_blank" rel="nofollow noopener noreferrer">How do I only download part of my remote repository? For example, I only need the final output of my pipeline, not the raw data or intermediate steps.</a><a href="#q-how-do-i-only-download-part-of-my-remote-repository-for-example-i-only-need-the-final-output-of-my-pipeline-not-the-raw-data-or-intermediate-steps" aria-label="q how do i only download part of my remote repository for example i only need the final output of my pipeline not the raw data or intermediate steps permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We support granular operations on DVC project repositories! Say your project's DVC remote contains several <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files corresponding to different stages of your pipeline: <code>0_process_data.dvc</code>, <code>1_split_test_train.dvc</code>, and <code>2_train_model.dvc</code>. If you're only interested in the files output by the final stage of the pipeline (<code>2_train_model.dvc</code>), you can run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> process_data_stage.dvc</span></code></pre></div> <p>You can also use <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> at the level of individual files. This might be needed if your DVC pipeline file creates 10 outputs, for example, and you only want to pull one (say, <code>model.pkl</code>, your trained model) from remote DVC storage. You'd simply run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc pull</span> model.pkl</span></code></pre></div> <h3 id="q-how-can-i-remove-a-dvc-file-but-keep-the-associated-files-in-my-workspace" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/689827778358673469" target="_blank" rel="nofollow noopener noreferrer">How can I remove a <code>.dvc</code> file, but keep the associated files in my workspace?</a><a href="#q-how-can-i-remove-a-dvc-file-but-keep-the-associated-files-in-my-workspace" aria-label="q how can i remove a dvc file but keep the associated files in my workspace permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Sometimes, you realize you don't want to put a file under DVC tracking after all. That's okay, easy to fix. Simply remove the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file like any other- <code>rm <file>.dvc</code>. DVC will then stop tracking the file, and the associated target file will still be in your local workspace. Note that the file will still be in your <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">DVC cache</a> unless you clear it with <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>.</p> <h3 id="q-im-trying-to-move-a-stage-file-with-dvc-move-but-im-getting-an-error-whats-going-on" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/685125650901630996" target="_blank" rel="nofollow noopener noreferrer">I'm trying to move a stage file with <code>dvc move</code>, but I'm getting an error. What's going on?</a><a href="#q-im-trying-to-move-a-stage-file-with-dvc-move-but-im-getting-an-error-whats-going-on" aria-label="q im trying to move a stage file with dvc move but im getting an error whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The <a href="https://dvc.org/doc/command-reference/move"><code>dvc move</code></a> command is used to rename a file or directory and simultaneously modify its corresponding DVC file. It's handy so you don't rename a file in your local workspace that's under DVC tracking without updating DVC to the change (see an <a href="https://dvc.org/doc/command-reference/move#description" target="_blank" rel="nofollow noopener noreferrer">example here</a>). The function doesn't work on <a href="https://dvc.org/doc/tutorials/pipelines#define-stages" target="_blank" rel="nofollow noopener noreferrer">"stage files"</a> from DVC pipelines. There's not currently an easy way to safely move <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a> files, and it's an <a href="https://github.com/iterative/dvc/issues/1489" target="_blank" rel="nofollow noopener noreferrer">open issue we're working on</a>. Until then, you can manually update <a href="https://dvc.org/doc/user-guide/project-structure/dvcyaml-files"><code>dvc.yaml</code></a>, or make a new one in the desired location.</p> <h3 id="q-i-just-starting-using-dvc-and-noticed-that-when-i-dvc-push-files-to-remote-cloud-storage-the-directory-in-my-remote-looks-like-my-dvc-cache-not-my-local-workspace-directory-is-this-right" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/693740598498426930" target="_blank" rel="nofollow noopener noreferrer">I just starting using DVC and noticed that when I <code>dvc push</code> files to remote cloud storage, the directory in my remote looks like my DVC cache, not my local workspace directory. Is this right?</a><a href="#q-i-just-starting-using-dvc-and-noticed-that-when-i-dvc-push-files-to-remote-cloud-storage-the-directory-in-my-remote-looks-like-my-dvc-cache-not-my-local-workspace-directory-is-this-right" aria-label="q i just starting using dvc and noticed that when i dvc push files to remote cloud storage the directory in my remote looks like my dvc cache not my local workspace directory is this right permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yep, that's exactly how it should be! In order to provide deduplication and some other optimizations, your DVC remote's directory structure will mirror the DVC cache (which is by default in your local workspace under <code>.dvc/cache</code>). Effectively, DVC uses your Git repository to store DVC files, which are keys for cache files on your remote. So looking inside your remote won't be particularly enlightening if you're looking for human-readable filenames- the file names will look like hashes (because, well, they are). Luckily, DVC handles all the conversions between the filenames in your local workspace and these hashes.</p> <p>To get some more intuition about this, check out some of our <a href="https://dvc.org/doc/user-guide/dvc-internals" target="_blank" rel="nofollow noopener noreferrer">docs</a> about how DVC organizes files.</p>https://dvc.org/blog/april-20-dvc-heartbeathttps://dvc.org/blog/april-20-dvc-heartbeatMon, 06 Apr 2020 00:00:00 GMT<p>Welcome to the April Heartbeat, our <a href="https://dvc.org/blog/tags/heartbeat" target="_blank" rel="nofollow noopener noreferrer">monthly roundup of cool happenings</a>, good reads and other bright spots in our community.</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><strong>Adapting to the pandemic.</strong> Although the world seems different than when we posted last month, the DVC community is steady and strong. As a predominantly distributed company, we've been developing our infrastructure for remote work from the get-go. It isn't always <em>easy</em> to schedule an all-hands meeting across 9 time zones but we make it work. This experience has prepared us well for the COVID-19 pandemic: although there are new challenges (like caring for families while working from home) we've been able to weather the transition to fully remote work relatively well.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 605px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6203b39de7f66012048047cb492129ac/03346/laptop_on_boat.jpg" alt="laptop on boat" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Before social distancing started, DVC technical writer Jorge Orpinel Pérez has worked from a canoe. Check out more photos from his workations <a href="https://www.instagram.com/workationer/" target="_blank" rel="nofollow noopener noreferrer">on Instagram</a>.</em></p> <p><strong>DVC sponsors DivOps.</strong> In a time when many conferences are going remote out of necessity, we were fortunate to be part of an <em>intentionally</em> remote conference this month! We sponsored <a href="https://divops.org/" target="_blank" rel="nofollow noopener noreferrer">DivOps</a>, a fully-online meeting led by women in DevOps. The DivOps lineup included speakers from GitHub, DropBox, Gremlin and more. DVC data scientist Elle (that's me!) gave a ten-minute talk about MLOps and CI/CD, so <a href="https://dvc.org/blog/reimagining-devops-video" target="_blank" rel="nofollow noopener noreferrer">please check out the video</a>. Another very relevant talk was from Anna Petrovicheva, CEO of <a href="http://xperience.ai/" target="_blank" rel="nofollow noopener noreferrer">Xperience AI</a>; Anna <a href="https://youtu.be/8nwpCQufeE0" target="_blank" rel="nofollow noopener noreferrer">spoke about her team's development workflow for deep learning projects</a> and gave a clear overivew of how they use DVC.</p> <p><strong>DVC on the airwaves.</strong> In early March, Elle was interviewed on an episode of <a href="https://www.interviewquery.com/tag/podcast/" target="_blank" rel="nofollow noopener noreferrer">The Data Stream podcast</a> about a DVC data science project, <a href="https://dvc.org/blog/a-public-reddit-dataset" target="_blank" rel="nofollow noopener noreferrer">building a public dataset of posts</a> from the "Am I the Asshole?" subreddit.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.interviewquery.com/blog-who-is-the-asshole/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The Data Stream #3 - Who is the A-hole? With Elle</h4> <div class="elp-description">Ever wonder if it's possible to train a model to discover whether your friends are assholes or not? Today Elle comes on the show to talk about her project building a classifier to predict the results from reddit's hottest advice community: Am I the Asshole (or AITA for short).</div> <div class="elp-link">interviewquery.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-04-06/data_stream-1b5639c7a93df053471157dd03ff1852.png" alt="The Data Stream #3 - Who is the A-hole? With Elle"> </div> </a> </section> <p></p> <h2 id="new-releases" style="position:relative;">New releases<a href="#new-releases" aria-label="new releases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This month, DVC has <a href="https://github.com/iterative/dvc/releases" target="_blank" rel="nofollow noopener noreferrer">released some new features</a> and updates:</p> <ul> <li>Did you know you can use Google Drive for remote storage with DVC? We've been hard at work delivering the best performance with Google Drive and are thrilled to invite users to try it out. Brand new <a href="https://dvc.org/doc/user-guide/setup-google-drive-remote#setup-a-google-drive-dvc-remote" target="_blank" rel="nofollow noopener noreferrer">docs</a> explain how to get started.</li> <li>We're introducing the <code>metrics diff</code> functionality, which lets you compare metrics from different commits side-by-side (<a href="https://dvc.org/doc/command-reference/metrics/diff" target="_blank" rel="nofollow noopener noreferrer">check out the docs</a> to learn more)</li> <li>Windows users, we are here for you. Contributor <a href="https://github.com/rxxg" target="_blank" rel="nofollow noopener noreferrer">rxxg</a> helped us get better performance on copy operations in Windows.</li> </ul> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><strong>DVC and R working together</strong> One of our favorite blogs this month came from Marcel Ribeiro-Dantas, a developer and PhD student at the <a href="https://institut-curie.org/" target="_blank" rel="nofollow noopener noreferrer">Institut Curie</a>. Marcel wrote about using DVC to manage projects in R, particularly defining and versioning pipelines of data processing and analysis that can be reproduced easily. While DVC is language agnostic, much of our user content has been Python-centric, so it's exciting to see a detailed post for the R-using data scientist (for more about R with DVC, see <a href="https://dvc.org/blog/r-code-and-reproducible-model-development-with-dvc" target="_blank" rel="nofollow noopener noreferrer">Marija Ilić's post</a>)!</p> <p> </p><section class="elp-content-holder"> <a href="https://mribeirodantas.xyz/blog/index.php/2020/03/05/r-dvc-and-rmarkdown/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Manage your Data Science Project in R</h4> <div class="elp-description">A simple project tutorial with R/RMarkdown, Packrat, Git, and DVC.</div> <div class="elp-link">mribeirodantas.xyz</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-04-06/marcel-6cb50d09d344473e4dc1f4b30ceff3d1.jpeg" alt="Manage your Data Science Project in R"> </div> </a> </section> <p></p> <p>Also, Marcel recently gave an interview on <a href="https://medium.com/data-hackers/health-data-e-o-coronav%C3%ADrus-data-hackers-podcast-22-2b059d460cb1" target="_blank" rel="nofollow noopener noreferrer">The Data Hackers Podcast</a>, a Portuguese-language show. Listen for a shout-out about DVC!</p> <p><strong>DVC is in another book!</strong> Last month we reported that DVC is part of a Packt book, <a href="https://www.packtpub.com/programming/learn-python-by-building-data-science-applications" target="_blank" rel="nofollow noopener noreferrer">"Learn Python by Building Data Science Applications"</a>. This month, DVC got a mention in a just-released O'Reilly book, <a href="https://www.oreilly.com/library/view/building-machine-learning/9781492053187/" target="_blank" rel="nofollow noopener noreferrer">"Building Machine Learning Pipelines"</a> by Hannes Hapke and Catherine Nelson.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.oreilly.com/library/view/building-machine-learning/9781492053187/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Building Machine Learning Pipelines</h4> <div class="elp-description">Automating Model Life Cycles with TensorFlow</div> <div class="elp-link">oreilly.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-04-06/oreilly-966c0911a0b738fecd45cd73637ed540.jpeg" alt="Building Machine Learning Pipelines"> </div> </a> </section> <p></p> <p><strong>Some more links we like.</strong> Here are a few other discussions that have caught our attention.</p> <ul> <li> <p><strong>MLOps can be fun.</strong> Jeroen France's blog, "MLOps: Not as boring as it sounds!", reads like a "coming of age" story about embracing engineering as a data scientist. It's part-motivational, part tutorial- definitely worth a read. Here's a sample:</p> <blockquote> <p>No-one wants to baby-sit, maintain, and troubleshoot their own models once they are in production. Every data scientist secretly hopes they can pawn that job off to an engineering team, or maybe an intern, right? Well, in fact MLOps is going to make your data science life a lot better.</p> </blockquote> </li> <li> <p><strong>Leveling up your Jupyter notebooks.</strong> In a series called <a href="https://ljvmiranda921.github.io/notebook/2020/03/16/jupyter-notebooks-in-2020-part-2/" target="_blank" rel="nofollow noopener noreferrer">"How to Use Jupyter Notebooks in 2020"</a>, Lj Miranda discusses how to use Jupyter Notebooks in a mature software development workflow. He makes several recommendations for tools, including DVC.</p> </li> <li> <p><strong>Reddit discussion about CI/CD</strong> When we shared around our DivOps conference presentation on Reddit, some <a href="https://www.reddit.com/r/MachineLearning/comments/fshh9p/p_a_talk_about_adapting_cicd_systems_for_ml_full/" target="_blank" rel="nofollow noopener noreferrer">great discussion happened</a>. We chatted about how CI/CD might work for data scientists, who often begin a project with a phase of rapid exploration, and what version control for ML could look like without Git.</p> </li> <li> <p><strong>Smashing the data monolith.</strong> Engineer Juan López López wrote a blog called <a href="https://medium.com/packlinkeng/a-complete-guide-about-how-to-break-the-data-monolith-caa2ab2d01f6" target="_blank" rel="nofollow noopener noreferrer">"A complete guide about how to break the data monolith"</a>, which is a neat manifesto about treating infrastructure <em>and</em> data as code. It's got nice coverage of DVC, code examples, and some deeply enjoyable artwork.</p> </li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 527px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f6bccc332f6efbef0e58e9e349ad59ab/03346/monolith.jpg" alt="monolith" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>From Juan Juan López López's <a href="https://medium.com/packlinkeng/a-complete-guide-about-how-to-break-the-data-monolith-caa2ab2d01f6" target="_blank" rel="nofollow noopener noreferrer">blog</a>.</em></p> <p>Thanks for reading. As always, let us know what you're making with DVC and what links are catching your interest in the blog comments, on <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>. Be safe and be in touch!</p>https://dvc.org/blog/reimagining-devops-videohttps://dvc.org/blog/reimagining-devops-videoTue, 31 Mar 2020 00:00:00 GMT<p>Last week, DVC was part of <a href="https://divops.org/" target="_blank" rel="nofollow noopener noreferrer">DivOps</a>, a fully remote conference led by women in DevOps. DevOps, to the newly anointed, is a discipline bringing together strong software engineering practices with speedy development cycles. As machine learning is finding its way into just about <em>every</em> area of research and development, we're going to need to come up with some conventions and tools for integrating machine learning and big data with software development. This growing field is called <a href="https://towardsdatascience.com/the-rise-of-the-term-mlops-3b14d5bd1bdb" target="_blank" rel="nofollow noopener noreferrer">MLOps</a>.</p> <p>I gave a lightning talk about how we'll have to rethink our software development practices in the age of machine learning. It's got a focus on <a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">CI/CD</a>, a way of structuring workflows that we think can streamline exchanges between data scientists and software engineers. And, it's got fuzzy animals. Check it out here:</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/0MDrZpO_7Q4?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>If you liked this, you'll also want to check out the next talk in the DivOps playlist by <a href="https://www.linkedin.com/in/anna-petrovicheva-44b24673/" target="_blank" rel="nofollow noopener noreferrer">Anna Petrovicheva</a>, Founder and CEO of Xperience AI. Anna's talk goes deeper into developing best practices for software engineering with deep learning.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/8nwpCQufeE0?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>All the talks from DivOps are <a href="https://www.youtube.com/playlist?list=PLVeJCYrrCemgbA1cWYn3qzdgba20xJS8V" target="_blank" rel="nofollow noopener noreferrer">available online now</a>, so please check out the YouTube channel. And stay tuned on our blog for more CI/CD discussions coming soon…</p>https://dvc.org/blog/march-20-community-gemshttps://dvc.org/blog/march-20-community-gemsThu, 12 Mar 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here are some Q&A's from our Discord channel that we think are worth sharing.</p> <h3 id="q-i-have-several-simulations-organized-with-git-tags-i-know-i-can-compare-the-metrics-with-dvc-metrics-diff-a_rev-b_rev-substituting-hashes-branches-or-tags-for-a_rev-and-b_rev-but-what-if-i-wanted-to-see-the-metrics-for-a-list-of-tags" style="position:relative;">Q: I have several simulations organized with Git tags. I know I can compare the metrics with <a href="https://dvc.org/doc/command-reference/metrics/diff"><code>dvc metrics diff [a_rev] [b_rev]</code></a>, substituting hashes, branches, or tags for [a_rev] and [b_rev]. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/687634347104403528" target="_blank" rel="nofollow noopener noreferrer">But what if I wanted to see the metrics for a list of tags?</a><a href="#q-i-have-several-simulations-organized-with-git-tags-i-know-i-can-compare-the-metrics-with-dvc-metrics-diff-a_rev-b_rev-substituting-hashes-branches-or-tags-for-a_rev-and-b_rev-but-what-if-i-wanted-to-see-the-metrics-for-a-list-of-tags" aria-label="q i have several simulations organized with git tags i know i can compare the metrics with dvc metrics diff a_rev b_rev substituting hashes branches or tags for a_rev and b_rev but what if i wanted to see the metrics for a list of tags permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC has a built in function for this! You can use <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> with the <code>-T</code> option:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics show</span> <span class="token parameter variable">-T</span></span></code></pre></div> <p>to list the metrics for all tagged experiments.</p> <p>Also, we have a couple of relevant discussions going on in our GitHub repo about <a href="https://github.com/iterative/dvc/issues/2799" target="_blank" rel="nofollow noopener noreferrer">handling experiments</a> and <a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">hyperparameter tuning</a>. Feel free to join the discussion and let us know what kind of support would help you most.</p> <h3 id="q-is-there-a-recommended-way-to-save-metadata-about-the-data-in-a-dvc-file-in-particular-id-like-to-save-summary-statistics-eg-mean-minimum-and-maximum-about-my-data" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/685105104340386037" target="_blank" rel="nofollow noopener noreferrer">Is there a recommended way to save metadata about the data in a <code>.dvc</code> file?</a> In particular, I'd like to save summary statistics (e.g., mean, minimum, and maximum) about my data.<a href="#q-is-there-a-recommended-way-to-save-metadata-about-the-data-in-a-dvc-file-in-particular-id-like-to-save-summary-statistics-eg-mean-minimum-and-maximum-about-my-data" aria-label="q is there a recommended way to save metadata about the data in a dvc file in particular id like to save summary statistics eg mean minimum and maximum about my data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>One simple way to keep metadata in a <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file is by using the <code>meta</code> field. Each <code>meta</code> entry is a <code>key:value</code> pair (for example, <code>name: Jean-Luc</code>). The <code>meta</code> field can be manually added or written programmatically, but note that if the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> file is overwritten (perhaps by <code>dvc run</code>, <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, or <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>) these values will not be preserved. You can read more about this <a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">in our docs</a>.</p> <p>Another approach would be to track the statistics of your dataset in a metric file, just as you might track performance metrics of a model. For a tutorial on using DVC metrics please <a href="https://dvc.org/doc/command-reference/metrics" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>.</p> <h3 id="q-my-team-has-been-using-dvc-in-production-when-we-upgraded-from-dvc-version-0710-we-started-getting-an-error-message-error-unexpected-error---my-folder-is-not-a-git-repository-whats-going-on" style="position:relative;">Q: My team has been using DVC in production. When we upgraded from DVC version 0.71.0, we started getting an error message: <code>ERROR: unexpected error - /my-folder is not a git repository</code>. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/687403454989467650" target="_blank" rel="nofollow noopener noreferrer">What's going on?</a><a href="#q-my-team-has-been-using-dvc-in-production-when-we-upgraded-from-dvc-version-0710-we-started-getting-an-error-message-error-unexpected-error---my-folder-is-not-a-git-repository-whats-going-on" aria-label="q my team has been using dvc in production when we upgraded from dvc version 0710 we started getting an error message error unexpected error my folder is not a git repository whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is a consequence of new support we've added for monorepos with the <a href="https://dvc.org/doc/command-reference/init#--subdir"><code>dvc init --subdir</code></a> functionality (<a href="https://dvc.org/doc/command-reference/init#init" target="_blank" rel="nofollow noopener noreferrer">see more here</a>), which lets there be multiple DVC projects within a single Git repository. Now, if a DVC repository doesn't contain a <code>.git</code> directory, DVC expects the <code>no_scm</code> flag to be present in <code>.dvc/config</code> and raises an error if not. For example, one of our users reported this when using DVC to pull files into a Docker container that didn't have Git initialized (for more about using DVC without Git, <a href="https://dvc.org/doc/command-reference/init#initializing-dvc-without-git" target="_blank" rel="nofollow noopener noreferrer">see our docs</a>).</p> <p>You can fix this by running <a href="https://dvc.org/doc/command-reference/config"><code>dvc config core.no_scm true</code></a> (you could include this command in the script that creates Docker images). Alternately, you could include <code>.git</code> in your Docker container, but this is not advisable for all situations.</p> <p>We are currently working to <a href="https://github.com/iterative/dvc/issues/3474" target="_blank" rel="nofollow noopener noreferrer">add graceful error-handling</a> for this particular issue so stay tuned.</p> <h3 id="q-is-there-a-way-to-force-the-pipeline-to-rerun-even-if-its-dependencies-havent-changed" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/687422002822381609" target="_blank" rel="nofollow noopener noreferrer">Is there a way to force the pipeline to rerun, even if its dependencies haven't changed?</a><a href="#q-is-there-a-way-to-force-the-pipeline-to-rerun-even-if-its-dependencies-havent-changed" aria-label="q is there a way to force the pipeline to rerun even if its dependencies havent changed permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> has a flag that should help here. You can use the <code>-f</code> or <code>--force</code> flag to reproduce the pipeline even when no changes in the dependencies (for example, a training datset tracked by DVC) have been found. So if you had a hypoethetical DVC pipeline whose final process was <code>deploy.dvc</code>, you could run <a href="https://dvc.org/doc/command-reference/repro#-f"><code>dvc repro -f deploy.dvc</code></a> to rerun the whole pipeline.</p> <h3 id="q-whats-the-best-way-to-organize-dvc-repositories-if-i-have-several-training-datasets-shared-by-several-projects-some-projects-use-only-one-dataset-while-other-use-several-can-one-project-have-dvc-files-corresponding-to-different-remotes" style="position:relative;">Q: What's the best way to organize DVC repositories if I have several training datasets shared by several projects? Some projects use only one dataset while other use several. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/670664813973864449" target="_blank" rel="nofollow noopener noreferrer">Can one project have <code>.dvc</code> files corresponding to different remotes?</a><a href="#q-whats-the-best-way-to-organize-dvc-repositories-if-i-have-several-training-datasets-shared-by-several-projects-some-projects-use-only-one-dataset-while-other-use-several-can-one-project-have-dvc-files-corresponding-to-different-remotes" aria-label="q whats the best way to organize dvc repositories if i have several training datasets shared by several projects some projects use only one dataset while other use several can one project have dvc files corresponding to different remotes permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, one project directory can contain datasets from several different DVC remotes. Specifically, DVC has functions <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> and <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> that emulate the experience of using a package manager for grabbing datasets from external sources. You can use <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> or <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> to access any number of datasets that are dependencies in a given project. For more on this, <a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">see our tutorial on data registries</a>.</p> <h3 id="q-what-are-the-risks-of-using-dvc-on-confidential-data" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/689848196473684024" target="_blank" rel="nofollow noopener noreferrer">What are the risks of using DVC on confidential data?</a><a href="#q-what-are-the-risks-of-using-dvc-on-confidential-data" aria-label="q what are the risks of using dvc on confidential data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC doesn't collect any information about your data (or code, or models, for that matter). You may have noticed that DVC <a href="https://dvc.org/doc/user-guide/analytics" target="_blank" rel="nofollow noopener noreferrer">collects Anonymized Usage Analytics</a>, which users may <a href="https://dvc.org/doc/user-guide/analytics#opting-out" target="_blank" rel="nofollow noopener noreferrer">opt out of</a>. The data we collect is extremely limited and anonymized, as it is collected mainly for the purpose of prioritizing bugs and feature development based on DVC usage. For example, we collect info about your operating system, DVC version, and installation method (the <a href="https://dvc.org/doc/user-guide/analytics#what" target="_blank" rel="nofollow noopener noreferrer">complete list of collected features is here</a>).</p> <p>Many of our users work with sensitive or private data, and we've developed DVC with such scenarios in mind from day one.</p> <h3 id="q-can-you-suggest-a-reference-architecture-for-using-dvc-as-part-of-mlops" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/683890642631524392" target="_blank" rel="nofollow noopener noreferrer">Can you suggest a reference architecture for using DVC as part of MLOps?</a><a href="#q-can-you-suggest-a-reference-architecture-for-using-dvc-as-part-of-mlops" aria-label="q can you suggest a reference architecture for using dvc as part of mlops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Increasingly, DVC is being used not to just to version and manage machine learning projects, but as part of MLOps, <em>practices for combining data science and software engineering</em>. As MLOps is a fairly new discipline, standards and references aren't yet solidified. So while there isn't (<em>yet</em>) a standard recipe for using DVC in MLOps projects, we can point you to a few architectures we like, and which have been reported in sufficient detail to recreate.</p> <p>First, DVC can be used to detect events (such as dataset changes) in a CI/CD system that traditional version control systems might not be able to. An excellent and thorough <a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">blog by Danilo Sato et al.</a> explores using DVC in this way, as part of a CI/CD system that retrains a model automatically when changes in the dataset are detected.</p> <p>Second, DVC can be used to support model training on cloud GPUs, particularly as a tool for pushing and pulling files (such as datasets and trained models) between cloud computing instances, DVC repositories, and other environments. This architecture was the subject of a <a href="https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/" target="_blank" rel="nofollow noopener noreferrer">recent blog by Marcel Mikl and Bert Besser</a>. Their report describes the cloud computing setup and continuous integration pipeline quite well.</p> <p>If you develop your own architecture for using DVC in MLOps, please keep us posted. We'll be eager to learn from your experience. Also, keep an eye on our blog in the next few months. We're rolling out some new tools with a focus on MLOps!</p>https://dvc.org/blog/march-20-dvc-heartbeathttps://dvc.org/blog/march-20-dvc-heartbeatWed, 11 Mar 2020 00:00:00 GMT<p>Welcome to the March Heartbeat! Here are some highlights from our team and community this past month:</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><strong>DVC is STILL growing!</strong> In February, Senior Software Engineer <a href="https://www.linkedin.com/in/jiojiajiu/" target="_blank" rel="nofollow noopener noreferrer">Guro Bokum</a> joined DVC. He's previously contributed to the core DVC code base and brings several years of full-stack engineering expertise to the team. Welcome, Guro!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/40e5aae472d45aa14f9f6daee17ff183/39600/hi_guro.png" alt="hi guro" title="Imgx667" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Welcome, Guro!</em></p> <p><strong>New feature alert.</strong> We've received many requests for <a href="https://en.wikipedia.org/wiki/Monorepo" target="_blank" rel="nofollow noopener noreferrer">monorepo</a> support in DVC. As of DVC <a href="https://github.com/iterative/dvc/releases" target="_blank" rel="nofollow noopener noreferrer">release 0.87.0</a>, users can version data science projects within a monorepo! The new <a href="https://dvc.org/doc/command-reference/init#--subdir"><code>dvc init --subdir</code></a> functionality is designed to allow multiple DVC repositories within a single Git repository. Don't forget to upgrade and <a href="https://dvc.org/doc/command-reference/init" target="_blank" rel="nofollow noopener noreferrer">check out the latest docs</a>.</p> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>First, there's an intriguing <a href="https://github.com/iterative/dvc/issues/3393" target="_blank" rel="nofollow noopener noreferrer">discussion evolving in the DVC repo</a> about how machine learning hyperparameters (such as learning rate, number of layers in a deep neural network, etc.) can be tracked. Right now, hyperparameters are tracked as source code (i.e., with Git). Could we use some kind of abstraction to separate hyperparameters from source code in a DVC-managed project? Read on and feel free to jump into this discussion, largely helmed by software developer and DVC contributor <a href="http://elgehelge.github.io/" target="_blank" rel="nofollow noopener noreferrer">Helge Munk Jacobsen</a>.</p> <p>Another discussion we appreciated happened on Twitter:</p> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">We give tools like Slack and Zoom a lot of credit for making remote work possible, and I think Git and every hosted DVC system should equally get the same credit. Imagine life for a second without version control. Think about that.</p>— Celestine (@cyberomin) <a href="https://twitter.com/cyberomin/status/1223651811082559488?ref_src=twsrc%5Etfw">February 1, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <p>Thanks, <a href="https://twitter.com/cyberomin" target="_blank" rel="nofollow noopener noreferrer">@cyberomin</a>!</p> <p>Elsewhere on the internet, DVC made the cut in a much-shared blog, <a href="https://medium.com/@squarecog/five-interesting-data-engineering-projects-48ffb9c9c501" target="_blank" rel="nofollow noopener noreferrer">Five Interesting Data Engineering Projects</a> by <a href="https://twitter.com/squarecog" target="_blank" rel="nofollow noopener noreferrer">Dmitry Ryaboy</a> (VP of Engineering at biotech startup Zymergen, and formerly Twitter). Dmitry wrote:</p> <blockquote> <p>To be honest, I’m a bit of a skeptic on “git for data” and various automated data / workflow versioning schemes: various approaches I’ve seen in the past were either too partial to be useful, or required too drastic a change in how data scientists worked to get a realistic chance at adoption. So I ignored, or even explicitly avoided, checking DVC out as the buzz grew. I’ve finally checked it out and… it looks like maybe this has legs? Metrics tied to branches / versions are a great feature. Tying the idea of git-like branches to training multiple models makes the value prop clear. The implementation, using Git for code and datafile index storage, while leveraging scalable data stores for data, and trying to reduce overall storage cost by being clever about reuse, looks sane. A lot of what they have to say in <a href="https://dvc.org/doc/understanding-dvc" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/doc/understanding-dvc</a> rings true.</p> </blockquote> <p>Check out the full blog here:</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@squarecog/five-interesting-data-engineering-projects-48ffb9c9c501" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Five Interesting Data Engineering Projects</h4> <div class="elp-description">There’s been a lot of activity in the data engineering world lately, and a ton of really interesting projects and ideas have come on the scene in the past few years. This post is an introduction to (just) five that I think a data engineer who wants to stay current needs to know about.</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-03-11/dmitry_r-dbf24d69ef2c84c69371729b25b99e50.jpg" alt="Five Interesting Data Engineering Projects"> </div> </a> </section> <p></p> <p>One of the areas that DVC is growing into is continuous integration and continuous deployment (CI/CD), a part of the nascent field of MLOps. Naturally, we were thrilled to discover that CI/CD with DVC is taught in a new Packt book, <a href="https://www.packtpub.com/programming/learn-python-by-building-data-science-applications" target="_blank" rel="nofollow noopener noreferrer">"Learn Python by Building Data Science Applications"</a> by David Katz and Philipp Kats.</p> <p>In the authors words, the goal of this book is to teach data scientists and engineers "not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards." Needless to say, we are considering starting a book club. Grab a copy here:</p> <p> </p><section class="elp-content-holder"> <a href="https://www.packtpub.com/programming/learn-python-by-building-data-science-applications" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Learn Python by Building Data Science Applications</h4> <div class="elp-description">Understand the constructs of the Python programming language and use them to build data science projects</div> <div class="elp-link">packtpub.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-03-11/packt-d8568693d8daf46cb46c15e3d6f1103d.jpeg" alt="Learn Python by Building Data Science Applications"> </div> </a> </section> <p></p> <p>Last year in Mexico, DVC contributor Ramón Valles gave a talk about reproducible machine learning workflows at Data Day Monterrey—and <a href="https://www.youtube.com/watch?v=tAxG-n20Di4" target="_blank" rel="nofollow noopener noreferrer">a video of his presentation</a> is now online! In this Spanish-language talk, Ramón gives a thorough look at DVC, particularly building pipelines for reproducible ML.</p> <p> </p><section class="elp-content-holder"> <a href="https://www.youtube.com/watch?v=tAxG-n20Di4" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Experimentación ágil de machine learning con DVC</h4> <div class="elp-description">Data Day Monterrey '19</div> <div class="elp-link">youtube.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-03-11/dataday_mr-80082fa35e146d3cc5e7ff0afdcb8857.png" alt="Experimentación ágil de machine learning con DVC"> </div> </a> </section> <p></p> <p>Finally, DVC data scientist Elle (that's me!) released a new public dataset of posts from the Reddit forum <a href="https://reddit.com/r/amitheasshole" target="_blank" rel="nofollow noopener noreferrer">r/AmItheAsshole</a>, and reported some preliminary analyses. We're inviting anyone and everyone to play with the data, make some hypotheses and share their findings. Check it out here:</p> <p> </p><section class="elp-content-holder"> <a href="https://blog.dvc.org/a-public-reddit-dataset" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">AITA for making this? A public dataset of Reddit posts about moral dilemmas</h4> <div class="elp-description">Delve into an open natural language dataset of posts about moral dilemmas from r/AmItheAsshole. Use this dataset for whatever you want- here's how to get it and start playing.</div> <div class="elp-link">blog.dvc.org</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-03-11/aita_sm-bb0ee7157daa11e5246496403d0f6e16.png" alt="AITA for making this? A public dataset of Reddit posts about moral dilemmas"> </div> </a> </section> <p></p> <p>That's all for now—thanks for reading, and be in touch on our <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub</a>, <a href="https://twitter.com/dvcorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a>, and <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>.</p>https://dvc.org/blog/february-20-community-gemshttps://dvc.org/blog/february-20-community-gemsWed, 19 Feb 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Welcome to the February roundup of useful, intriguing, and good-to-know discussions going on with DVC users and developers. Let's dive right in with some questions from our Discord channel.</p> <h3 id="q-if-i-have-multiple-outputs-from-a-dvc-pipeline-and-only-want-to-checkout-one-what-command-would-i-run" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/670233820326264843" target="_blank" rel="nofollow noopener noreferrer">If I have multiple outputs from a DVC pipeline and only want to checkout one, what command would I run?</a><a href="#q-if-i-have-multiple-outputs-from-a-dvc-pipeline-and-only-want-to-checkout-one-what-command-would-i-run" aria-label="q if i have multiple outputs from a dvc pipeline and only want to checkout one what command would i run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>By default, <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> is written for a <a href="https://dvc.org/doc/command-reference/checkout" target="_blank" rel="nofollow noopener noreferrer">Git-like experience</a>, meaning that it will sync your local workspace with all the model files, dependencies, and outputs specified by a project's <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> files. If you only want to access one artifact from the project, you can do this with <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout <path to file></code></a>. This will deliver the specified file to your workspace.</p> <p>If you're interested in sharing specific artifacts (like data files or model binaries) with other users, you might also consider <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> and <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>. These functions are ideal for downloading a single file (or a few files) to the local workspace, instead of the whole project.</p> <h3 id="q-i-have-a-complicated-use-case-were-trying-to-set-up-a-system-where-users-act-as-data-scientists-theyd-select-data-which-would-be-cleanedtransformed-in-the-backend-and-experiment-with-model-hyperparameters-until-theyre-happy-with-the-model-result-then-they-can-save-the-model-including-artifacts-like-the-input-data-used-metrics-and-binary-model-file-placing-the-experiment-under-version-control-later-they-can-load-the-model-again-and-select-new-input-data-from-our-database-change-parameters-and-update-it-there-might-be-hundreds-of-separate-models-can-dvc-do-this" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/668773484549242890" target="_blank" rel="nofollow noopener noreferrer">I have a complicated use case.</a> We're trying to set up a system where users act as data scientists. They'd select data, which would be cleaned/transformed in the backend, and experiment with model hyperparameters until they're happy with the model result. Then they can "save" the model, including artifacts like the input data used, metrics, and binary model file, placing the experiment under version control. Later they can "load" the model again and select new input data from our database, change parameters, and "update it". There might be hundreds of separate models. Can DVC do this?<a href="#q-i-have-a-complicated-use-case-were-trying-to-set-up-a-system-where-users-act-as-data-scientists-theyd-select-data-which-would-be-cleanedtransformed-in-the-backend-and-experiment-with-model-hyperparameters-until-theyre-happy-with-the-model-result-then-they-can-save-the-model-including-artifacts-like-the-input-data-used-metrics-and-binary-model-file-placing-the-experiment-under-version-control-later-they-can-load-the-model-again-and-select-new-input-data-from-our-database-change-parameters-and-update-it-there-might-be-hundreds-of-separate-models-can-dvc-do-this" aria-label="q i have a complicated use case were trying to set up a system where users act as data scientists theyd select data which would be cleanedtransformed in the backend and experiment with model hyperparameters until theyre happy with the model result then they can save the model including artifacts like the input data used metrics and binary model file placing the experiment under version control later they can load the model again and select new input data from our database change parameters and update it there might be hundreds of separate models can dvc do this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Most of this functionality is supported by DVC already. We recommend <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> as a method for giving users access to data in a repostiory (and also check out our <a href="https://dvc.org/doc/use-cases/data-registries" target="_blank" rel="nofollow noopener noreferrer">tutorial on data registries</a>). For pre-processing data, <a href="https://dvc.org/doc/get-started/pipeline" target="_blank" rel="nofollow noopener noreferrer">DVC pipelines</a> can automate a procedure for transforming and cleaning inputs (i.e., you can use bash scripts to <code>dvc run</code> the pipeline whenever a user selects a dataset). Saving the workspace after experimentation, including model files, metrics, and outputs, is a core function of DVC (see <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> functions). We also have a <a href="https://dvc.org/doc/use-cases/data-registries#programatic-reusability-of-dvc-data" target="_blank" rel="nofollow noopener noreferrer">Python API</a> so users can load artifacts like datasets and model files into their local Python session. When they're done experimenting, they can <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> and <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> their progress. Users can later "pull" a saved workspace and all associated files using <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a></p> <p>As for how to organize hundreds of separate experiments, we're still evolving our strategy and best-practice recommendations. It's conceivable that each experiment could be carried out and saved on a separate branch of a project repository. Our thoughts about structuring version control around architecture search and hyperparameter tuning could fill up a whole blog (and probably will in the not-so-distant future); check out one of our <a href="https://github.com/iterative/dvc/issues/2799" target="_blank" rel="nofollow noopener noreferrer">recent conversation threads</a> if you'd like to see where we're currently at. And please let us know how your use case goes—at this stage, we'd love to hear what works for you.</p> <h3 id="q-whats-the-difference-between-config-and-configlocal-files-is-it-safe-to-do-git-commit-without-including-my-config-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/666708671333400599" target="_blank" rel="nofollow noopener noreferrer">What's the difference</a> between <code>config</code> and <code>config.local</code> files? Is it safe to do git commit without including my config file?<a href="#q-whats-the-difference-between-config-and-configlocal-files-is-it-safe-to-do-git-commit-without-including-my-config-file" aria-label="q whats the difference between config and configlocal files is it safe to do git commit without including my config file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are indeed two kinds of config files you might come across in your project directory's <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> folder and <code>.gitignore</code> file. The key difference is that <code>config</code> is intended to be committed to Git, while <code>config.local</code> is not. You'd use <code>config.local</code> to store sensitive information (like personal credentials for SSH or another kind of authenticated storage) or settings specific to your local environment—things you wouldn't want to push to a GitHub repo. DVC only modifies <code>config.local</code> when you explicitly use the <code>--local</code> flag in the <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a> or <a href="https://dvc.org/doc/command-reference/remote"><code>dvc remote *</code></a> commands, so outside of these cases you shouldn't have to worry about it.</p> <p>As for using <code>git commit</code> without the <code>config</code> file, it is safe. <em>But</em> you should check if there are any settings in <code>config.local</code> that you actually want to save to <code>config</code>. This would be rare, since as we mentioned, you'd only have settings in <code>config.local</code> if you expressly called for them with the <code>--local</code> flag.</p> <h3 id="q-i-have-an-azure-storage-account-container-and-the-only-link-i-can-see-in-my-azure-portal-for-the-container-is-an-http-link-but-the-tutorial-on-dvc-shows-azure-storage-accessed-with-the-azure-protocol-which-is-right" style="position:relative;">Q: I have an Azure storage account container, and the only link I can see in my Azure portal for the container is an <code>http://</code> link. But the tutorial on DVC shows Azure storage accessed with the <code>azure://</code> protocol. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/675087897661276169" target="_blank" rel="nofollow noopener noreferrer">Which is right?</a><a href="#q-i-have-an-azure-storage-account-container-and-the-only-link-i-can-see-in-my-azure-portal-for-the-container-is-an-http-link-but-the-tutorial-on-dvc-shows-azure-storage-accessed-with-the-azure-protocol-which-is-right" aria-label="q i have an azure storage account container and the only link i can see in my azure portal for the container is an http link but the tutorial on dvc shows azure storage accessed with the azure protocol which is right permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>What you're describing is exactly as it should be. <code>azure://</code> is an internal URL protocol that tells DVC which API to use to connect to your remote storage, not the exact address of your Blob. You can use the format <code>azure://<container-name>/<optional-path></code>. For more details, you can refer to our documentation about <a href="https://dvc.org/doc/command-reference/remote/add#supported-storage-types" target="_blank" rel="nofollow noopener noreferrer">supported storage types</a>.</p> <h3 id="q-im-using-dvc-to-version-my-data-with-google-drive-storage-if-i-want-a-developer-to-be-able-to-download-the-data-can-i-give-them-my-gdrive_client_id-and-gdrive_client_secret-or-maybe-give-them-permission-to-access-my-google-drive-folder" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/667198775361536019" target="_blank" rel="nofollow noopener noreferrer">I'm using DVC to version my data with Google Drive storage.</a> If I want a developer to be able to download the data, can I give them my <code>gdrive_client_id</code> and <code>gdrive_client_secret</code>, or maybe give them permission to access my Google Drive folder?<a href="#q-im-using-dvc-to-version-my-data-with-google-drive-storage-if-i-want-a-developer-to-be-able-to-download-the-data-can-i-give-them-my-gdrive_client_id-and-gdrive_client_secret-or-maybe-give-them-permission-to-access-my-google-drive-folder" aria-label="q im using dvc to version my data with google drive storage if i want a developer to be able to download the data can i give them my gdrive_client_id and gdrive_client_secret or maybe give them permission to access my google drive folder permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>For Google Drive, <code>gdrive_client_id</code> and <code>gdrive_client_secret</code> aren't used to access a specific user's Google Drive disk; they're predominantly used by Google's API to <a href="https://rclone.org/drive/#making-your-own-client-id" target="_blank" rel="nofollow noopener noreferrer">track usage and set appropriate rate limits</a>. So the risk in sharing them is not that your personal files will be vulnerable, but that your API usage limits could be negatively affected if others are using it with your credentials. Whether this risk is acceptable is up to you. It's not unusual for teams and organizations to share a set of credentials, so a reasonable level of security may mean ensuring that the <code>config</code> file for your project (which typically contains Google Drive credentials) is only visible to team members.</p> <p>Please check out our <a href="https://dvc.org/doc/user-guide/setup-google-drive-remote" target="_blank" rel="nofollow noopener noreferrer">docs about Google Drive</a>, too, for more about how DVC uses the Google Drive API.</p> <h3 id="q-i-just-tried-to-upgrade-dvc-via-homebrew-and-got-a-sha256-mismatch-error-whats-going-on" style="position:relative;">Q: I just tried to upgrade DVC via <code>homebrew</code> and got a "SHA256 mismatch" error. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/672930535261339669" target="_blank" rel="nofollow noopener noreferrer">What's going on</a>?<a href="#q-i-just-tried-to-upgrade-dvc-via-homebrew-and-got-a-sha256-mismatch-error-whats-going-on" aria-label="q i just tried to upgrade dvc via homebrew and got a sha256 mismatch error whats going on permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>What most likely happened is that you first installed DVC via <code>brew install iterative/homebrew-dvc/dvc</code>, which is no longer supported—because DVC is now a core Homebrew formula! Please uninstall and reinstall using <code>brew install dvc</code> for uninterrupted upgrades in the future.</p> <h3 id="q-i-still-cant-convince-myself-to-version-control-the-data-rather-than-meta-data-can-anyone-give-me-a-strong-argument-against-version-controlling-data-file-paths-in-config-files-instead-of-using-dvc" style="position:relative;">Q: <a href="https://www.reddit.com/r/datascience/comments/aqkg59/does_anyone_use_data_version_control_dvc_thoughts/eq62lkt?utm_source=share&utm_medium=web2x" target="_blank" rel="nofollow noopener noreferrer">I still can't convince myself to version-control the data rather than meta-data.</a> Can anyone give me a strong argument against version controlling data file paths in config files instead of using DVC?<a href="#q-i-still-cant-convince-myself-to-version-control-the-data-rather-than-meta-data-can-anyone-give-me-a-strong-argument-against-version-controlling-data-file-paths-in-config-files-instead-of-using-dvc" aria-label="q i still cant convince myself to version control the data rather than meta data can anyone give me a strong argument against version controlling data file paths in config files instead of using dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><em>This question is from a <a href="https://bit.ly/38HOEcj" target="_blank" rel="nofollow noopener noreferrer">Reddit discussion.</a></em></p> <p>Versioning the meta-data associated with your dataset is certainly a workable strategy. You can use prefixes and suffixes to distinguish models trained on different versions of data, and keep your data files in one <code>.gitignored</code> directory. That may be enough for some projects. In our experience, though, we've found this comes with a host of complications that don't scale well:</p> <ol> <li>You'll have to write custom code to support this configuration, specifying filepaths to your dataset with hardcoded links.</li> <li>For files that are outputs of your analysis pipeline, you'll need to agree on conventions for suffixes/prefixes for naming to specify which version of the dataset was used.</li> <li>Depending on the meta-data you use to version data files, you may not detect changes made by users. Even if you can tell a change has occurred, you may not be able to track <em>who</em> did it <em>when</em>.</li> </ol> <p>We designed DVC to optimize data management from the user's perspective: users can change the dataset version without changing their code, so organizations don't have to adhere to explicit filenaming conventions and hardcoded links that are prone to human error. Furthermore, versioning data similar to how Git versions code provides a largely immutable record of every change that has occurred. We think this is important as teams and projects grow in complexity. And from a systems-level perspective, DVC does more than track data: it dedpulicates files behind the scenes, provides simple interfaces for sharing datasets (and models!) with collaborators and users, and connects specific model files with the dataset versions they were trained on.</p> <p>To summarize, DVC is not the only way to version your data. But we think it's one way to reduce the overhead of managing data infrastructure when your project involves experimentation or collaboration.</p>https://dvc.org/blog/a-public-reddit-datasethttps://dvc.org/blog/a-public-reddit-datasetMon, 17 Feb 2020 00:00:00 GMT<p>In data science, we frequently deal with classification problems like, <em>is this <a href="https://www.ics.uci.edu/~vpsaini/" target="_blank" rel="nofollow noopener noreferrer">Yelp reviewer unhappy</a> with their brunch? Is <a href="https://archive.ics.uci.edu/ml/datasets/spambase" target="_blank" rel="nofollow noopener noreferrer">this email</a> begging me to claim my long-lost inheritance spam? Does this <a href="http://ai.stanford.edu/~amaas/data/sentiment/" target="_blank" rel="nofollow noopener noreferrer">movie critic</a> have a positive opinion of Cats?</em></p> <p>Perhaps we should also consider the fundamental introspective matter of, <em>am I maybe being a bit of an asshole?</em></p> <p>I want to share a dataset of collected moral dilemmas shared on Reddit, as well as the judgments handed down by a jury of Redditors. The wellspring of this data is the <a href="https://www.reddit.com/r/AmItheAsshole/" target="_blank" rel="nofollow noopener noreferrer">r/AmItheAsshole</a> subreddit, one of the natural wonders of the digital world. In this article, I'll show you what's in the dataset, how to get it, and some things you can do to move the frontiers of Asshole research forward.</p> <h2 id="what-makes-an-asshole" style="position:relative;">What makes an Asshole?<a href="#what-makes-an-asshole" aria-label="what makes an asshole permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>r/AmItheAsshole is a semi-structured online forum that’s the internet’s closest approximation of a judicial system. In this corner of the web, citizens post situations from their lives and Redditors vote to decide if the writer has acted as The Asshole or not. For example:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b140849336e06dc98fe3b111add8224e/39600/aita_sample.png" alt="aita sample" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Without bringing any code into the picture, it’s intuitive to think of each new post as a classification task for the subreddit. Formally, we could think of the subreddit as executing a function <em>f</em> such that</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 500px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9b43c96909c892a85245ab99f863766e/39600/aita_formula.png" alt="aita formula" title="aita formula" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Of course, finding f won’t be trivial. To be frank, I’m not positive how well we could hope to forecast the rulings of the subreddit. A lot of posts are not easy for me to decide- like,</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 680px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/42054fa4bae8f5578a79e5da28bd5181/39600/aita_llama.png" alt="aita llama" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>There are also many times I find myself disagreeing with the subreddit’s verdict. All this is to say, I don’t think it’s obvious how well a given human would do on the task of predicting whether Redditors find someone an Asshole. Nor is it clear how well we could ever hope for a machine to do approximating their judgment.</p> <p>It seems fun to try, though. It helps that the data is plentiful: because the subreddit is popular and well-moderated, there’s an especially strong volume of high-quality content (re: on-topic and appropriately formatted) being posted daily.</p> <h2 id="building-the-dataset" style="position:relative;">Building the dataset<a href="#building-the-dataset" aria-label="building the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>I pulled content from r/AmITheAsshole dating from the first post in 2012 to January 1, 2020 using the <a href="https://pushshift.io/" target="_blank" rel="nofollow noopener noreferrer">pushshift.io</a> API to get post ids and <a href="https://www.reddit.com/wiki/faq#wiki_how_is_a_submission.27s_score_determined.3F" target="_blank" rel="nofollow noopener noreferrer">scores</a>, followed by Reddit’s API (<a href="https://praw.readthedocs.io/en/latest/" target="_blank" rel="nofollow noopener noreferrer">praw</a>) to get post content and meta-data. Using a <a href="https://openai.com/blog/better-language-models/" target="_blank" rel="nofollow noopener noreferrer">similar standard as OpenAI</a> for trawling Reddit, I collected text from posts with scores of 3 or more only for quality control. This cut the number of posts from ~355K to ~111K. Each data point contains an official id code, timestamp, post title, post text, verdict, score, and comment count; usernames are not included. The scraping and cleaning code is available <a href="https://github.com/iterative/aita_dataset" target="_blank" rel="nofollow noopener noreferrer">in the project GitHub repo</a>. For simplicity on the first iteration of this problem, I didn’t scrape post comments, which can number in the thousands for popular posts. But, should sufficient interest arise, I’d consider adding them to the dataset in some form.</p> <p>To focus on the task of classifying posts, I did some light cleaning: I removed posts in which the body of the text was redacted (surprisingly common) or blank, and attempted to remove edits where the author had clearly given away the verdict (e.g., an edit that says, “Update: You’re right, I was the asshole.”). There were also verdicts that only occurred once (“cheap asshole”, “Crouching Liar; hidden asshole”, “the pizza is the asshole”), so I restricted the dataset to posts with standard verdicts. This left ~63K points. Below is a sample of the resulting dataframe:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4a12da0214084c297acabff7878e4852/39600/df_sample.png" alt="df sample" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Click to enlarge.</em></p> <p>The dataset is a snapshot of the subreddit in its current state, but the subreddit is certain to change over time as new content gets added. In the interest of having the most comprehensive dataset about being an asshole ever collected, <em>I’m planning to update this dataset monthly with new posts.</em></p> <h2 id="how-to-get-the-dataset" style="position:relative;">How to get the dataset<a href="#how-to-get-the-dataset" aria-label="how to get the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Since this dataset will be updated regularly, we’re using git and DVC to package, version, and release it. The data itself is stored in an S3 bucket, and you can use DVC to import the data to your workspace. If you haven't already you'll need to <a href="https://dvc.org/doc/install" target="_blank" rel="nofollow noopener noreferrer">install DVC</a>; one of the simplest ways is <code>pip install dvc</code>.</p> <p>Say you have a directory on your local machine where you plan to build some analysis scripts. Simply run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> https://github.com/iterative/aita_dataset <span class="token punctuation">\</span> aita_clean.csv</span></code></pre></div> <p>This will download a .csv dataset into your local directory, corresponding to the cleaned version. If you wanted the raw dataset, you would substitute <code>aita_raw.csv</code> for <code>aita_clean.csv</code>.</p> <p>Because the dataset is >100 MB, I’ve created a git branch (called “lightweight”) with 10,000 randomly sampled (cleaned) data points for quick-and-dirty experimentation that won’t occupy all your laptop’s memory. To download only this smaller dataset, run</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--rev</span> lightweight <span class="token punctuation">\</span> https://github.com/iterative/aita_dataset <span class="token punctuation">\</span> aita_clean.csv</span></code></pre></div> <h2 id="a-quick-look-at-the-data" style="position:relative;">A quick look at the data<a href="#a-quick-look-at-the-data" aria-label="a quick look at the data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Let’s take a flyover look at the dataset so far. The code to make the following visuals and results is <a href="https://github.com/andronovhopf/aita_viz_and_classify" target="_blank" rel="nofollow noopener noreferrer">available on GitHub</a>. First, here’s a frequency plot for how common different verdicts are on the subreddit. In addition to “Asshole” and “Not the Asshole”, there are two additional rulings: “Everybody Sucks” and “No Assholes Here”.</p> <p><img src="https://dvc.org/2020-02-17/freq_plot-88ad442f08b13a57408f75dd5b54bd63.svg" alt=""></p> <p>In general agreement with an <a href="http://www.nathancunn.com/2019-04-04-am-i-the-asshole/" target="_blank" rel="nofollow noopener noreferrer">analysis by Nathan Cunn</a>, the majority of posts are deemed “Not the Asshole” or “No Assholes Here”. If you are posting on r/AmITheAsshole, you are probably not the asshole.</p> <p>Next, I attempted a very basic classifier, logistic regression using 1-gram frequencies (i.e., the frequency of word occurences in post titles and bodies) as features. This is intended to give a baseline for what kind of performance any future modeling efforts should beat. Because of the strong class imbalance, I used <a href="https://imbalanced-learn.org/stable/over_sampling.html#smote-variants" target="_blank" rel="nofollow noopener noreferrer">SMOTE to oversample</a> Asshole posts. And, for simplicity, I binarized the category labels:</p> <table><thead><tr><th align="center">Verdict</th><th align="center">Label</th></tr></thead><tbody><tr><td align="center">Asshole</td><td align="center">1</td></tr><tr><td align="center">Everyone Sucks</td><td align="center">1</td></tr><tr><td align="center">Not the Asshole</td><td align="center">0</td></tr><tr><td align="center">No Assholes Here</td><td align="center">0</td></tr></tbody></table> <p>With 5-fold cross-validation, this classifier performed above-chance but modestly: accuracy was 62.0% +/- 0.005 (95% confidence interval). Curiously, the only other classifier attempt I could find online <a href="https://github.com/amr-amr/am-i-the-asshole" target="_blank" rel="nofollow noopener noreferrer">reported 61% accuracy on held-out data</a> using the much more powerful BERT architecture. Considering that logistic regression has zero hidden layers, and our features discard sequential information entirely, we’re doing quite well! Although I can’t be certain, I’m curious how much the discrepancy comes down to dataset size: the previous effort with BERT appears to be trained on ~30K posts.</p> <p>Seeing that logistic regression on word counts doesn’t produce total garbage, I looked at which words were predictive of class using the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html" target="_blank" rel="nofollow noopener noreferrer">chi-squared test</a>. The top five informative words were mom, wife, mother, edit, and dad (looks like Assholes go back to edit their posts). Since familial relationships featured prominently, I <a href="https://www.tidytextmining.com/twitter.html#comparing-word-usage" target="_blank" rel="nofollow noopener noreferrer">estimated the log odds ratio</a> of being voted Asshole (versus Not the Asshole) if your post mentions a mom, dad, girlfriend/wife or boyfriend/husband. Roughly, the log odds ratio represents the difference in probability of a keyword occurring in Asshole posts compared to Not-Asshole posts.</p> <p><img src="https://dvc.org/2020-02-17/svg_kw2-b03427e1361dcff2e52b80a34525e4a2.svg" alt=""></p> <p>For reference, the log odd ratios are computed with base 2; a score of 1 means that Asshole posts are twice as likely to contain the keyword as Not the Asshole posts. So keep in mind that the effect sizes we’re detecting, although almost certainly non-zero, are still fairly small.</p> <p>There seems to be a slight anti-parent trend, with Redditors being more likely to absolve authors who mention a mom or dad. Only mentioning a female romantic partner (wife/girlfriend) was associated with a greater likelihood of being voted the Asshole. This surprised me. My unsubstantiated guess about the gender difference in mentioning romantic partners is that women may be particularly likely to question themselves when they act assertively in a relationship. If this were the case, we might find an especially high proportion of uncontroversial “Not the Asshole” posts from heterosexual women asking about situations with their male partners.</p> <h2 id="how-to-get-more-data" style="position:relative;">How to get more data<a href="#how-to-get-more-data" aria-label="how to get more data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As I said earlier, the plan is to grow the dataset over time. I’ve just run a new scrape for posts from January 1-31, 2020 and am adding them to the public dataset now. To check for a new release, you can re-run the <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> command you used to grab the dataset.</p> <p>If you’re serious about taking on a project such as, say, building a classifier that beats our state of the art, word-count-based, logistic regression model, I’d like to recommend a better way to integrate the dataset into your workflow: <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>. <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> is like <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a>, but it preserves a link to the hosted data set. This is desirable if you might iterate through several experiments in the search for the right architecture, for example, or think you’ll want to re-train a model . To get the dataset the first time, you’ll run:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git init</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://github.com/iterative/aita_dataset <span class="token punctuation">\</span> aita_clean.csv</span></code></pre></div> <p>Then, because the dataset in your workspace is linked to our dataset repository, you can update it by simply running:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc update</span> aita_clean.csv</span></code></pre></div> <p>An additional benefit of codifying the link between your copy of the dataset and ours is that you can track the form of the dataset you used at different points in your project development. You can jump back and forth through the project history then, not only to previous versions of code but also to versions of (specifically, links to) data. For example, you could roll back the state of the project to before you updated the dataset and re-run your classifier:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">git</span> log <span class="token parameter variable">--oneline</span> </span>58e28a5 retrain logistic reg 6a44161 update aita dataset 0de4fc3 try logistic regression classifier a266f15 get aita dataset 55031b0 first commit <span class="token line"><span class="token input">$ </span><span class="token git">git checkout</span> 0de4fc3 </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc checkout</span> </span><span class="token line"><span class="token input">$ </span><span class="token command">python</span> train_classifier.py</span></code></pre></div> <p>Oh, and one more note: you can always use <a href="https://dvc.org/doc/command-reference/get"><code>dvc get</code></a> and <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> to grab an older version of the dataset using the tags associated with each release. The current release is v.20.1 and the original release is v.20.0- the numeric codes correspond to the year and month.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc get</span> <span class="token parameter variable">--rev</span> v.20.0 <span class="token punctuation">\</span> https://github.com/iterative/aita_dataset aita_clean.csv</span></code></pre></div> <h2 id="whats-next" style="position:relative;">What’s next<a href="#whats-next" aria-label="whats next permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>I hope that sharing this evolving dataset invites some curiosity, because a lot of questions come to mind:</p> <ol> <li>Can you beat our classifier that predicts how the subreddit will rule?</li> <li>Is verdict even the most interesting outcome to predict? For example, developer Scott Ratigan <a href="https://github.com/scotteratigan/amitheahole" target="_blank" rel="nofollow noopener noreferrer">created a tool to estimate weighted scores</a> for each post based on the comments (e.g., 75% Asshole, 25% Not the Asshole). What metrics might invite deeper questions?</li> <li>Can you identify sentences or phrases that are most informative about the verdict Redditors reach?</li> <li>Do voting patterns systematically differ by topic of discussion?</li> <li>How reliable are verdicts? When a very similar situation is posted multiple times, do Redditors usually vote the same way?</li> <li>Is the subreddit’s posting and voting behavior changing over time?</li> <li>Can you formulate any testable hypotheses based on <a href="https://www.reddit.com/r/AmItheAsshole/comments/dcae07/2019_subscriber_survey_data_dump/?" target="_blank" rel="nofollow noopener noreferrer">this survey of the subreddit’s demographics</a></li> <li>How often do non-Redditors agree with the subreddit? Under what circumstances might they tend to disagree?</li> </ol> <p>I expect that leaning into the particulars of the dataset- thinking about how the format influences the content, and how a subreddit might select for participants that don’t fully represent the population at large- will lead to more interesting questions than, say, aiming to forecast something about morality in general. To put it another way, the data’s not unbiased- so maybe try to learn something about those biases.</p> <p>If you make something with this dataset, please share- perhaps we can form an international Asshole research collective, or at least keep each other appraised of findings. And of course, reach out if you encounter any difficulties or probable errors (you can file issues <a href="https://github.com/iterative/aita_dataset" target="_blank" rel="nofollow noopener noreferrer">on the GitHub repo</a>)!</p> <p>Lastly, please stay tuned for more releases- there are hundreds of new posts every day. The biggest asshole may still be out there.</p> <hr> <h3 id="more-resources" style="position:relative;">More resources<a href="#more-resources" aria-label="more resources permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You may want to check out a few more efforts to get at r/AmItheAsshole from a data-scientific perspective, including <a href="https://medium.com/@tom.gonda/what-does-reddit-argue-about-28432b11ea26" target="_blank" rel="nofollow noopener noreferrer">topic modeling</a>, <a href="http://www.nathancunn.com/2019-04-04-am-i-the-asshole/" target="_blank" rel="nofollow noopener noreferrer">visualizing voting patterns</a> and <a href="https://twitter.com/felipehoffa/status/1223278090958209025" target="_blank" rel="nofollow noopener noreferrer">growth of the subreddit</a>, and <a href="https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/teaching/studienDiplomArbeiten/finished/2019/expose_fletcher.pdf" target="_blank" rel="nofollow noopener noreferrer">classification</a> with <a href="https://github.com/amr-amr/am-i-the-asshole" target="_blank" rel="nofollow noopener noreferrer">deep learning</a>. With a dataset this rich, there’s much more to be investigated, including continuing to refine these existing methods. And there’s almost certainly room to push the state of the art in asshole detection!</p> <p>If you're interested in learning more about using Reddit data, check out <a href="https://pushshift.io/" target="_blank" rel="nofollow noopener noreferrer">pushshift.io</a>, a database that contains basically all of Reddit's content (so why make this dataset? I wanted to remove some of the barriers to analyzing text from r/AmItheAsshole by providing an already-processed and cleaned version of the data that can be downloaded with a line of code; pushshift takes some work). You might use pushshift's API and/or praw to augment this dataset in some way- perhaps to compare activity in this subreddit with another, or broader patterns on Reddit.</p>https://dvc.org/blog/february-20-dvc-heartbeathttps://dvc.org/blog/february-20-dvc-heartbeatMon, 10 Feb 2020 00:00:00 GMT<p>Welcome to the February Heartbeat! This month's featured image is a DVC pipeline <a href="https://medium.com/nlp-trend-and-review-en/use-dvc-to-version-control-ml-dl-models-bef61dbfe477" target="_blank" rel="nofollow noopener noreferrer">created by one of our users</a>, which <em>we</em> think resembles a valentine. Here are some more highlights from our team and our community:</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><strong>Our team is growing!</strong> In early January, DVC gained two new folks: engineer <a href="https://github.com/skshetry" target="_blank" rel="nofollow noopener noreferrer">Saugat Pachhai</a> and data scientist <a href="https://twitter.com/andronovhopf" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a>. Saugat, based in Nepal, will be contributing to core DVC. Elle (that's me!), currently in San Francisco, will be leading data science projects and outreach with DVC.</p> <p>We're <strong>gearing up for a spring full of talks</strong> about DVC projects, including new up-and-coming features for data cataloging and continuous integration. Here are just a few events that have been added to our schedule:</p> <p> </p><section class="elp-content-holder"> <a href="https://www.mlprague.com/#schedule-saturday" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Machine Learning Prague - March 19</h4> <div class="elp-description">DVC engineer Pawel Redzynski will talk about open source tools for versioning machine learning projects.</div> <div class="elp-link">mlprague.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-02-10/mlprague-409b825d8df0cec780675a46f056799a.jpg" alt="Machine Learning Prague - March 19"> </div> </a> </section> <p></p> <p> </p><section class="elp-content-holder"> <a href="https://www.mlprague.com/#schedule-saturday" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DivOps 2020 - March 24</h4> <div class="elp-description">Elle O'Brien is talking about open source software in the growing field of MLOps at this international, remote conference.</div> <div class="elp-link">https://divops.org/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-02-10/divops_logo-b53c4509a15b5cab656d1c2f21412dfe.png" alt="DivOps 2020 - March 24"> </div> </a> </section> <p></p> <p> </p><section class="elp-content-holder"> <a href="https://www.mlprague.com/#schedule-saturday" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Women in Data Science San Diego - May 9</h4> <div class="elp-description">Elle O'Brien will be delivering a keynote talk about data catalogs and feature stores.</div> <div class="elp-link">https://www.widsconference.org/</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-02-10/wids-3d1684ad41ad2a2ba1b8f263e88163a7.jpeg" alt="Women in Data Science San Diego - May 9"> </div> </a> </section> <p></p> <p>-Elle O'Brien was recently accepted to give a keynote at <a href="https://www.widsconference.org/" target="_blank" rel="nofollow noopener noreferrer">Women in Data Science</a> San Diego on May 9. The talk is called "Packaging data and machine learning models for sharing."</p> <p>-Elle will also be speaking at <a href="https://divops.org/" target="_blank" rel="nofollow noopener noreferrer">Div Ops</a>, a new online conference about (you guessed it) DevOps, on March 27.</p> <p>Look out for more conference announcements soon- in our <strong>brand new community page!</strong> We've <a href="https://dvc.org/community" target="_blank" rel="nofollow noopener noreferrer">just launched a new hub</a> for sharing events, goings-ons, and ways to contribute to DVC.</p> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Our users continue to put awesome things on the internet. Like this AI blogger who isn't afraid to wear his heart on his sleeve.</p> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@matlihan/my-favorite-data-science-tool-is-dvc-data-version-control-e6ab8aed24d2" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">My favorite data science tool is DVC - Data Version Control</h4> <div class="elp-description">by Musa Atlıhan</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-02-10/musa_atlihan-a2ebdc84c073c368fb8ed093d3576db0.jpeg" alt="My favorite data science tool is DVC - Data Version Control"> </div> </a> </section> <p></p> <p>Musa Atlihan writes:</p> <blockquote> <p>From my experience, whether it is a real-world data science project or it is a data science competition, there are two major key components for success. Those components are API simplicity and reproducible pipelines. Since data science means experimenting a lot in a limited time frame, first, we need machine learning tools with simplicity and second, we need reliable/reproducible machine learning pipelines. Thanks to tools like Keras, LightGBM, and fastai we already have simple yet powerful tools for rapid model development. And thanks to DVC, we are building large projects with reproducible pipelines very easily.</p> </blockquote> <p>It's cool how Musa puts DVC in context with libraries for model building. In a way, the libraries that have made it easier than ever to iterate through different model architectures have increased the need for reproducibility in proportion.</p> <p>Meanwhile in Germany, superusers Marcel Mikl and Bert Besser wrote <a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" target="_blank" rel="nofollow noopener noreferrer">another</a> seriously comprehensive article about DVC for Codecentric. Marcel and Bert walk readers through the steps to <strong>build a custom machine learning training pipeline with remote computing resources</strong> like GCP and AWS. It's an excellent guide to configuring model training with attention to <em>automation</em> and <em>collaboration</em>. We give them 🦉🦉🦉🦉🦉 out of 5.</p> <p> </p><section class="elp-content-holder"> <a href="https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Remote training with GitLab-CI and DVC</h4> <div class="elp-description">by Marcel Mikl and Bert Besser</div> <div class="elp-link">blog.codecentric.de</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-02-10/marcel-176e6cfec67a3f909f1f2c8b58615383.png" alt="Remote training with GitLab-CI and DVC"> </div> </a> </section> <p></p> <p>Here are a few more stories on our radar:</p> <ul> <li> <p><strong>AI Singapore shares their method for AI development and deployment.</strong> This .. <a href="https://makerspace.aisingapore.org/2020/01/agile-ai-engineering-in-aisg/" target="_blank" rel="nofollow noopener noreferrer">blog about how Agile informs their processes</a> for continuous integration and delivery includes data versioning.</p> </li> <li> <p><strong>Toucan AI dispenses advice for ML engineers.</strong> This .. <a href="https://toucanai.com/blog/post/building-production-ml/" target="_blank" rel="nofollow noopener noreferrer">blog for practitioners</a> discusses questions like, "When to work on ML vs. the processes that surround ML". It covers how DVC is used for model versioning in the exploration stage of ML.</p> </li> <li> <p><strong>DVC at the University.</strong> A recent .. <a href="https://arxiv.org/pdf/1912.01706.pdf" target="_blank" rel="nofollow noopener noreferrer">pre-print from natural language processing researchers at Université Laval</a> explains how DVC facilitated dataset access for collaborators.</p> <blockquote> <p>"In our case, the original dataset takes up to 6 Gigabytes. The previous way of retrieving the dataset over the network with a standard 20 Mbits/sec internet connexion took up to an hour to complete (including uncompressing the data). Using DVC reduced the retrieval time of the dataset to 3 minutes over the network with the same internet connexion."</p> </blockquote> <p>Thanks for sharing- this is a lovely result. Oh, and last…</p> </li> <li> <p><strong>DVC is a job requirement</strong>! We celebrated a small milestone when we stumbled .. across a listing for a data engineer to support R&D at <a href="https://www.elvie.com/en-us/" target="_blank" rel="nofollow noopener noreferrer">Elvie</a>, a maker of tech for women's health (pretty neat mission). The decorations on the job posting are ours 😎</p> </li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 470px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/f0e8a9d4e7525ba2c56504833e14c3cd/39600/elvie.png" alt="elvie" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>A job advertisement featuring DVC.</em></p>https://dvc.org/blog/gsoc-ideas-2020https://dvc.org/blog/gsoc-ideas-2020Tue, 04 Feb 2020 00:00:00 GMT<p>Announcement, announcement! After a successful experience with <a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a> in 2019, we're putting out a call for students to apply to work with DVC as part of <a href="https://summerofcode.withgoogle.com/" target="_blank" rel="nofollow noopener noreferrer">Google Summer of Code</a>. If you want to make a dent in open source software development with mentorship from our team, read on.</p> <h2 id="prerequisites-to-apply" style="position:relative;">Prerequisites to apply<a href="#prerequisites-to-apply" aria-label="prerequisites to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Besides the general requirements to apply to Google Summer of Code, there are a few skills we look for in applicants.</p> <ol> <li><strong>Python experience.</strong> All of our core development is done in Python, so we prefer candidates that are experienced in Python. However, we will consider applicants who are very strong in another language and familiar with Python basics.</li> <li><strong>Git experience.</strong> Git is also a key part of DVC development, as DVC is built around Git; that said, for certain projects (rated as “Beginner”) a surface-level knowledge of Git will be sufficient.</li> <li><strong>People skills.</strong> Beyond technical fundamentals, we put a high value on communication skills: the ability to report and document your experiments and findings, to work kindly with teammates, and explain your goals and work clearly.</li> </ol> <p>If you like our mission but aren't sure if you're sufficiently prepared, please be in touch anyway. We'd love to hear from you.</p> <h2 id="project-ideas" style="position:relative;">Project ideas<a href="#project-ideas" aria-label="project ideas permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Below are several project ideas that are an immediate priority for the core DVC team. Of course,we welcome students to create their own proposals, even if they differ from our ideas. Projects will be primarily mentored by co-founders <a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> and <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">Ivan Shcheklein</a>.</p> <ol> <li> <p><strong>Migrate to the latest v3 API to improve Google Drive support.</strong> Our organization is a co-maintainer of the PyDrive library in collaboration with a team at Google. The PyDrive library is now several years old and still relies on the v2 protocol. We would like to migrate to v3, which we expect will boost performance for many DVC use cases (e.g. the ability to filter fields being retrieved from our API, etc). For this project, we’re looking for a student to work with us to prepare the next major version of the PyDrive library, as well as making important changes to the core DVC code to support it. Because PyDrive is broadly used outside of DVC, this project is a chance to work on a library of widespread interest to the Python community. <br> <br> <em>Skills required:</em> Python, Git, experience with APIs <br> <em>Difficulty rating:</em> Beginner-Medium <br></p> </li> <li> <p><strong>Introducing parallelism to DVC.</strong> One of DVC’s features is the ability to create pipelines, linking data repositories with code to process data, train models, and evaluate model metrics. Once a DVC pipeline is created, the pipeline can be shared and re-run in a systematic and entirely reproducible way. Currently, DVC executes pipelines sequentially, even though some steps may be run in parallel (such as data preprocessing). We would like to support parallelization for pipeline steps specified by the user. Furthermore, we’ll need to support building flags into DVC commands that specify the level of parallelization (CPU, GPU or memory). <br> <br> <em>Skills required:</em> Python, Git. Some experience with parallelization and/or scientific computing would be helpful but not required. <br> <em>Difficulty rating:</em> Advanced <br></p> </li> <li> <p><strong>Developing use cases for data registries and ML model zoos.</strong> A new DVC functionality that we’re particularly excited about is <code>summon</code>, a method that can turn remotely-hosted machine learning artifacts such as datasets, trained models, and more into objects in the user’s local environment (such as a Jupyter notebook). This is a foundation for creating data catalogs of data-frames and machine learning model zoos on top of Git repositories and cloud storages (like GCS or S3). We need to identify and implement model zoos (think PyTorch Hub, the Caffe Model Zoo, or the TensorFlow DeepLab Model Zoo) and data registries for types that are not supported by DVC yet. Currently, we’ve tested <code>summon</code> with PyTorch image segmentation models and Pandas dataframes. We’re looking for students to explore other possible use cases. <br> <br> <em>Skills required:</em> Python, Git, and some machine learning or data science experience <br> <em>Difficulty rating:</em> Beginner-Medium <br></p> </li> <li> <p><strong>Continuous delivery for JetBrains TeamCity.</strong> Continuous integration and continuous delivery (CI/CD) for ML projects is an area where we see <a href="https://martinfowler.com/articles/cd4ml.html" target="_blank" rel="nofollow noopener noreferrer">DVC make a big impact</a>- specifically, by delivering datasets and ML models into CI/CD pipelines. While there are many cases when DVC is used inside GitHub Actions and GitLab CI, you will be transferring this experience to another type of CI/CD system, <a href="https://www.jetbrains.com/teamcity/" target="_blank" rel="nofollow noopener noreferrer">JetBrains TeamCity</a>. We're working to integrate DVC's model and dataset versioning into TeamCity's CI/CD toolkit. This project would be ideal for a student looking to explore the growing field of MLOps, an offshoot of DevOps with the specifics of ML projects at the center. <br> <br> <em>Skills required:</em> Python, Git, bash scripting. It would be nice, but not necessary, to have some experience with CI/CD tools and developer workflow automation. <br> <em>Difficulty rating:</em> Medium-Advanced <br></p> </li> <li> <p><strong>DVC performance testing framework.</strong> Performance is a core value of DVC. We will be creating a performance monitoring and testing framework where new scenarios (e.g., unit testing)can be populated. The framework should reflect all performance improvements and degradations for each of the DVC releases. It would be especially compelling if testing could be integrated with our GitHub workflow (CI/CD). This is a great opportunity for a student to learn about DVC and versioning in-depth and contribute to its stability. <br> <br> <em>Skills required:</em> Python, Git, bash scripting. <br> <em>Difficulty rating:</em> Medium-Advanced <br></p> </li> </ol> <h2 id="if-youd-like-to-apply" style="position:relative;">If you'd like to apply<a href="#if-youd-like-to-apply" aria-label="if youd like to apply permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Please refer to the <a href="https://summerofcode.withgoogle.com/" target="_blank" rel="nofollow noopener noreferrer">Google Summer of Code</a> application guides for specifics of the program. Students looking to know more about DVC, and our worldwide community of contributors, will learn most by visiting our <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord channel</a>, <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">GitHub repository</a>, and <a href="https://discuss.dvc.org/" target="_blank" rel="nofollow noopener noreferrer">Forum</a>. We are available to discuss project proposals from interested students and can be reached by <a href="mailto:[email protected]" target="_blank" rel="nofollow noopener noreferrer">email</a> or on our Discord channel.</p>https://dvc.org/blog/january-20-community-gemshttps://dvc.org/blog/january-20-community-gemsMon, 20 Jan 2020 00:00:00 GMT<h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There's a lot of action in our Discord channel these days. Ruslan, DVC's core maintainer, said it best with a gif.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">How it feels when <a href="https://twitter.com/DVCorg">@DVCorg</a> team is handling multiple conversations on Discord at the same time. <a href="https://t.co/QrLusdWYml">https://t.co/QrLusdWYml</a></p>— Ruslan Kuprieiev 🇺🇦 (@rkuprieiev) <a href="https://twitter.com/rkuprieiev/status/1144008869414342658">June 26, 2019</a></blockquote> <p>It's a lot to keep up with, so here are some highlights. We think these are useful, good-to-know, and interesting conversations between DVC developers and users.</p> <h3 id="q-what-pros-does-dvc-have-compared-to-git-lfs" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/657590900754612284" target="_blank" rel="nofollow noopener noreferrer">What pros does DVC have compared to Git LFS?</a><a href="#q-what-pros-does-dvc-have-compared-to-git-lfs" aria-label="q what pros does dvc have compared to git lfs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>For an in-depth answer, check out this <a href="https://stackoverflow.com/questions/58541260/difference-between-git-lfs-and-dvc" target="_blank" rel="nofollow noopener noreferrer">Stack Overflow discussion</a>. But in brief, with DVC you don't need a special server, and you can use nearly any kind of storage (S3, Google Cloud Storage, Azure Blobs, your own server, etc.) without a fuss. There are also no limits on the size of the data that you can store, unlike with GitHub. With Git LFS, there are some general LFS server limits, too. DVC has additional features for sharing your data (e.g., <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>) and has pipeline support, so it does much more than LFS. Plus, we have flexible and quick checkouts, as we utilize different link types (reflinks, symlinks, and hardlinks). We think there are lots of advantages; of course, the usefulness will depend on your particular needs.</p> <h3 id="q-how-do-i-use-dvc-with-ssh-remote-storage-i-usually-connect-with-a-pem-key-file-how-do-i-do-the-same-with-dvc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/656016145119182849" target="_blank" rel="nofollow noopener noreferrer">How do I use DVC with SSH remote storage?</a> I usually connect with a .pem key file. How do I do the same with DVC?<a href="#q-how-do-i-use-dvc-with-ssh-remote-storage-i-usually-connect-with-a-pem-key-file-how-do-i-do-the-same-with-dvc" aria-label="q how do i use dvc with ssh remote storage i usually connect with a pem key file how do i do the same with dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC is built to work with the SSH protocol to access remote storage (we provide some <a href="https://dvc.org/doc/user-guide/external-dependencies#ssh" target="_blank" rel="nofollow noopener noreferrer">examples in our official documentation</a>). When SSH requires a key file, try this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote keyfile <span class="token operator"><</span>path to *.pem<span class="token operator">></span></span></code></pre></div> <h3 id="q-if-you-train-a-tensorflow-model-that-creates-multiple-checkpoint-files-how-do-you-establish-them-as-dependencies-in-the-dvc-pipeline" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/651098762466426891" target="_blank" rel="nofollow noopener noreferrer">If you train a TensorFlow model that creates multiple checkpoint files, how do you establish them as dependencies in the DVC pipeline?</a><a href="#q-if-you-train-a-tensorflow-model-that-creates-multiple-checkpoint-files-how-do-you-establish-them-as-dependencies-in-the-dvc-pipeline" aria-label="q if you train a tensorflow model that creates multiple checkpoint files how do you establish them as dependencies in the dvc pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can specify a directory as a dependency/output in your DVC pipeline, and store checkpointed models in that directory. It might look like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token punctuation">\</span> <span class="token parameter variable">-f</span> train.dvc <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> data <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> train.py <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> models python code/train.py</span></code></pre></div> <p>where <code>models</code> is a directory created for checkpoint files. If you would like to preserve your models in the data directory, though, then you would need to specify them one by one. You can do this with bash:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token variable"><span class="token variable">$(</span><span class="token keyword">for</span> <span class="token for-or-select variable">file</span> <span class="token keyword">in</span> data/*.gz<span class="token punctuation">;</span> <span class="token keyword">do</span> <span class="token builtin class-name">echo</span> <span class="token parameter variable">-n</span> <span class="token parameter variable">-d</span> $file<span class="token punctuation">;</span> <span class="token keyword">done</span><span class="token variable">)</span></span></span></code></pre></div> <p>Be careful, though: if you declare checkpoint files to be an output of the DVC pipeline, you won’t be able to re-run the pipeline using those checkpoint files to initialize weights for model training. This would introduce circularity, as your output would become your input.</p> <p>Also keep in mind that whenever you re-run a pipeline with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>, outputs are deleted and then regenerated. If you don't wish to automatically delete outputs, there is a <code>--persist</code> flag (see discussion <a href="https://github.com/iterative/dvc/issues/1214" target="_blank" rel="nofollow noopener noreferrer">here</a> and <a href="https://github.com/iterative/dvc/issues/1884" target="_blank" rel="nofollow noopener noreferrer">here</a>), although we don't currently provide technical support for it.</p> <p>Finally, remember that setting something as a dependency (<code>-d</code>) doesn't mean it is automatically tracked by DVC. So remember to <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> data files in the beginning!</p> <h3 id="q-is-it-possible-to-use-the-same-cache-directory-for-multiple-dvc-repos-that-are-used-in-parallel-or-do-i-need-external-software-to-prevent-potential-race-conditions" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/655012135973158942" target="_blank" rel="nofollow noopener noreferrer">Is it possible to use the same cache directory for multiple DVC repos that are used in parallel?</a> Or do I need external software to prevent potential race conditions?<a href="#q-is-it-possible-to-use-the-same-cache-directory-for-multiple-dvc-repos-that-are-used-in-parallel-or-do-i-need-external-software-to-prevent-potential-race-conditions" aria-label="q is it possible to use the same cache directory for multiple dvc repos that are used in parallel or do i need external software to prevent potential race conditions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is absolutely possible, and you don't need any external software to safely use multiple DVC repos in parallel. With DVC, cache operations are atomic. The only exception is cleaning the cache with <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a>, which you should only run when no one else is working on a shared project that is referenced in your cache (and also, be sure to use the <code>--projects</code> flag <a href="https://dvc.org/doc/command-reference/gc" target="_blank" rel="nofollow noopener noreferrer">as described in our docs</a>). For more about using multiple DVC repos in parallel, check out some discussions <a href="https://discuss.dvc.org/t/setup-dvc-to-work-with-shared-data-on-nas-server/180" target="_blank" rel="nofollow noopener noreferrer">here</a> and <a href="https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="q-what-are-some-strategies-for-reproducibility-if-parts-of-our-model-training-pipeline-are-run-on-our-organizationss-hpc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/652380507832844328" target="_blank" rel="nofollow noopener noreferrer">What are some strategies for reproducibility if parts of our model training pipeline are run on our organizations's HPC?</a><a href="#q-what-are-some-strategies-for-reproducibility-if-parts-of-our-model-training-pipeline-are-run-on-our-organizationss-hpc" aria-label="q what are some strategies for reproducibility if parts of our model training pipeline are run on our organizationss hpc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Using DVC for version control is entirely compatible with using remote computing resources, like high performance computing (HPC), in your model training pipeline. We think a great example of using DVC with parallel computing is provided by <a href="http://www.peterfogh.dk/" target="_blank" rel="nofollow noopener noreferrer">Peter Fogh</a> Take a <a href="https://github.com/PeterFogh/dvc_dask_use_case" target="_blank" rel="nofollow noopener noreferrer">look at his repo</a> for a detailed use case. Please keep us posted about how HPC works in your pipeline, as we'll be eager to pass on any insights to the community.</p> <h3 id="q-say-i-have-a-git-repository-with-multiple-projets-inside-one-classification-one-object-detection-etc-is-it-possible-to-tell-dvc-to-just-pull-data-for-one-particular-project" style="position:relative;">Q: Say I have a Git repository with multiple projets inside (one classification, one object detection, etc.). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/646760832616890408" target="_blank" rel="nofollow noopener noreferrer">Is it possible to tell DVC to just pull data for one particular project?</a><a href="#q-say-i-have-a-git-repository-with-multiple-projets-inside-one-classification-one-object-detection-etc-is-it-possible-to-tell-dvc-to-just-pull-data-for-one-particular-project" aria-label="q say i have a git repository with multiple projets inside one classification one object detection etc is it possible to tell dvc to just pull data for one particular project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Absolutely, DVC supports pulling data from different DVC files. An example would be having two project subdirectories in your Git repo, <code>classification</code> and <code>detection</code>. You could use <a href="https://dvc.org/doc/command-reference/pull#-R"><code>dvc pull -R classification</code></a> to only pull files in that project to your workspace.</p> <p>If you prefer to be even more granular, you can <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> files individually. Then you can use <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull <filename>.dvc</code></a> to retrieve the outputs specified only by that file.</p> <h3 id="q-is-it-possible-to-set-an-s3-remote-without-the-use-of-aws-credentials-with-dvc-i-want-to-publicly-host-a-dataset-so-that-everybody-who-clones-my-code-repo-can-just-run-dvc-pull-to-fetch-the-dataset" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/623234659098296348" target="_blank" rel="nofollow noopener noreferrer">Is it possible to set an S3 remote without the use of AWS credentials with DVC?</a> I want to publicly host a dataset so that everybody who clones my code repo can just run <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> to fetch the dataset.<a href="#q-is-it-possible-to-set-an-s3-remote-without-the-use-of-aws-credentials-with-dvc-i-want-to-publicly-host-a-dataset-so-that-everybody-who-clones-my-code-repo-can-just-run-dvc-pull-to-fetch-the-dataset" aria-label="q is it possible to set an s3 remote without the use of aws credentials with dvc i want to publicly host a dataset so that everybody who clones my code repo can just run dvc pull to fetch the dataset permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, and we love the idea of publicly hosting a dataset. There are a few ways to do it with DVC. We use one method in our own DVC project repository on Github. If you run <code>git clone https://github.com/iterative/dvc</code> and then <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>, you’ll see that DVC is downloading data from an HTTP repository, which is actually just an S3 repository that we've granted public HTTP read-access to.</p> <p>So you would need to configure two remotes in your config file, each pointing to the same S3 bucket through different protocols. Like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> <span class="token parameter variable">--local</span> myremote s3://bucket/path </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> mypublicemote http://s3-external-1.amazonaws.com/bucket/path</span></code></pre></div> <p>Here's why this works: the <code>-d</code> flag sets the default remote, and the <code>--local</code> flag creates a set of configuration preferences that will override the global settings when DVC commands are run locally and won't be shared through Git (you can read more about this <a href="https://dvc.org/doc/command-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">in our docs</a>).</p> <p>This means that even though you and users from the public are accessing the stored dataset by different protocols (S3 and HTTPS), you'll all run the same command: <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>.</p>https://dvc.org/blog/january-20-dvc-heartbeathttps://dvc.org/blog/january-20-dvc-heartbeatFri, 17 Jan 2020 00:00:00 GMT<p>Welcome to the New Year! Time for a recap of the last few weeks of activity in the DVC community.</p> <h2 id="news" style="position:relative;">News<a href="#news" aria-label="news permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We were honored to be named a <a href="https://ods.ai/awards/2019/" target="_blank" rel="nofollow noopener noreferrer">Project of the Year</a> by Open Data Science, Russia's largest community of data scientists and machine learning practitioners. Check out our ⭐️incredibly shiny trophy⭐️!</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">DVC is the "project of the year" according to @odsai_en!<br>😱🏆🎉<br>Open Data Science the largest DS community we know, with over 40K active members, great courses and it's own conf Data Fest.<br>Many thanks to the organizers and voters!<br>This is the best surprize gift for the team!!🥳 <a href="https://t.co/LZgewjM582">pic.twitter.com/LZgewjM582</a></p>— 🦉DVC (@DVCorg) <a href="https://twitter.com/DVCorg/status/1209544709930016768">December 24, 2019</a></blockquote> <p>DVC hit <strong>100 individual contributors</strong> on Github! To celebrate our 100<sup>th</sup> contributor, <a href="https://github.com/verasativa/" target="_blank" rel="nofollow noopener noreferrer">Vera Sativa</a>, we sent her $500 to use on any educational opportunity and her own DeeVee (that's our rainbow owl). We also awarded educational mini-grants to two of DVC's biggest contributors, <a href="https://github.com/witiko" target="_blank" rel="nofollow noopener noreferrer">Vít Novotný</a>, and <a href="https://twitter.com/david_prihoda" target="_blank" rel="nofollow noopener noreferrer">David Příhoda</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 612px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/78b685e283d679c8ebe518ea17520f6d/39600/odd_with_deevee.png" alt="odd with deevee" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Vera (center, flashing a peace sign) thanked us with this lovely picture of DeeVee and her team, <a href="https://odd.co" target="_blank" rel="nofollow noopener noreferrer">Odd Industries</a>. They are making some extremely neat tools for construction teams using computer vision.</em></p> <p><strong>We were at PyData LA!</strong> Our fearless leader <a href="https://www.youtube.com/watch?v=7Wsd6V0k4Oc" target="_blank" rel="nofollow noopener noreferrer">Dmitry gave a talk</a> and we set up a busy booth to meet with the Pythonistas of Los Angeles. It was a cold and blustery day, but visitors kept showing up to our semi-outdoor booth. We're sure they came for the open source version control and not the donuts.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 512px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/c827a7148f442ec7b39f79659a697878/03346/py_data1.jpg" alt="py data1" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 512px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/76308821da8925b6cf7540b9b0b1ea3f/03346/py_data2.jpg" alt="py data2" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>The DVC team and PyData volunteers who heroically staffed our booth in the rain.</em></p> <p>Our engineer and technical writer Jorge reported:</p> <blockquote> <p>We were super happy to meet all kinds of data professionals and enthusiasts in several fields who are learning and adopting DVC with their teams – including several working with privacy-sensitive medical records, very cool!</p> </blockquote> <hr> <h2 id="from-the-community" style="position:relative;">From the community<a href="#from-the-community" aria-label="from the community permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Here are some rumblings from the machine learning (ML) and data science community that got us talking.</p> <p><strong>A machine learning software wishlist.</strong> Computer scientist and writer <a href="https://twitter.com/chipro" target="_blank" rel="nofollow noopener noreferrer">Chip Huyen</a> tweeted about her ML software wishlist and kicked off a big community discussion.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I've been thinking about the software stack for machine learning. Tools I'd love to see.<br><br>1. Pip for pretrained models.<br>2. Version control for datasets.<br>3. GPU-friendly CI. Travis CI, Circe CI don't support GPUs. Jenkins is a pain.<br>4. Fast dataframes. Why is Pandas so slow?</p>— Chip Huyen (@chipro) <a href="https://twitter.com/chipro/status/1202815757593108480">December 6, 2019</a></blockquote> <p>Her tweet resonated with a lot of practitioners, who were eager to discuss the solutions they'd tried. Among the many thoughtful replies and recommendations, we were thrilled to see DVC mentioned.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">We're using <a href="https://twitter.com/DVCorg">@DVCorg</a> for 2) and it works great. 🙂</p>— Kristijan (@kristijan_moves) <a href="https://twitter.com/kristijan_moves/status/1202879739716870144">December 6, 2019</a></blockquote> <p>If you haven't already, definitely check out Chip's <a href="https://twitter.com/chipro/status/1202815757593108480" target="_blank" rel="nofollow noopener noreferrer">thread</a>, and follow her on Twitter for more excllent, accessible content about ML engineering. We're thinking hard about these ideas and hope the discussion continues on- and offline.</p> <p><strong>A gentle intro to DVC for data scientists.</strong> Scientist <a href="https://twitter.com/andronovhopf" target="_blank" rel="nofollow noopener noreferrer">Elle O'Brien</a> published a code walkthrough about using DVC to make an image classification project more reproducible. Specifically, the blog is a case study about version control when a dataset grows over time. If you're looking for a DVC tutorial geared for data scientists, this might be up your alley.</p> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/start-version-controlling-your-machine-learning-datasets-2b872e109856" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Start Version Controlling your Machine Learning Datasets</h4> <div class="elp-description">Make your machine learning and data science projects reproducible with open source tools.</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-01-17/medium_1-65307a03bcb030a905958954107696f0.png" alt="Start Version Controlling your Machine Learning Datasets"> </div> </a> </section> <p></p> <p><strong>Ideas for data scientists to level up their code</strong> Machine learning engineer Andrew Greatorex posted a blog called “Down with technical debt! Clean Python for data scientists.” Andrew highlights something we can easily relate to: the “science” part of data science, which encourages experimentation and flexibility, sometimes means less emphasis on readable, shareable code. Andrew writes:</p> <blockquote> <p>"I’m hoping to shed light on some of the ways that more fledgling data scientists can write cleaner Python code and better structure small scale projects, with the important side effect of reducing the amount of technical debt you inadvertently burden on yourself and your team.”</p> </blockquote> <p>In this blog, DVC gets a shout-out as Andrew’s preferred data versioning tool, used in conjunction with Git for versioning Python code. Thanks!</p> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/down-with-technical-debt-clean-python-for-data-scientists-aa7592eff7fc" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Down with technical debt! Clean Python for data scientists.</h4> <div class="elp-description"></div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-01-17/medium_2-3ad04f500f8cbbe7108a635482d68baf.png" alt="Down with technical debt! Clean Python for data scientists."> </div> </a> </section> <p></p> <p><strong>An introduction to MLOps</strong> Engineer <a href="https://twitter.com/elfouly_sharif" target="_blank" rel="nofollow noopener noreferrer">Sharif Elfouly</a> wrote an approachable guide to thinking about MLOps, the growing field around making ML projects run efficiently from experimentation to production. He summarises why managing ML projects can be fundamentally different than traditional software development:</p> <blockquote> <p>“The main difference between traditional software and ML is that you don’t only have the code. You also have data, models, and experiments. Writing traditional software is relatively straightforward but in ML you need to try out a lot of different things to find the best and fastest model for your use-case. You have a lot of different model types to choose from and every single one of them has its specific hyperparameters. Even if you work alone this can get out of hand pretty quickly.”</p> </blockquote> <p>Sharif gives some recommendations for tools that work especially well for ML, and he writes that DVC is the “perfect combination for versioning your code and data.” Thanks, Sharif! We think you’re perfect, too.</p> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/down-with-technical-debt-clean-python-for-data-scientists-aa7592eff7fc" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">MLOps Done Right</h4> <div class="elp-description">What is MLOps? Why is it so important? How to do it right!</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2020-01-17/medium_3-225e59dae2ce1f7bef910517f0bd8ae6.png" alt="MLOps Done Right"> </div> </a> </section> <p></p> <p>That's a wrap for January. We'll see you next month with more updates!</p>https://dvc.org/blog/november-19-dvc-heartbeathttps://dvc.org/blog/november-19-dvc-heartbeatSat, 14 Dec 2019 00:00:00 GMT<p>The past few months have been so busy and full of great events! We love how involved our community is and can’t wait to share more with you:</p> <ul> <li> <p>We have organized our very first <a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" target="_blank" rel="nofollow noopener noreferrer">meetup</a>! So many great conversations, new use cases and insights! Many thanks to <a href="https://www.linkedin.com/in/daniel-fischetti-4a6592bb/" target="_blank" rel="nofollow noopener noreferrer">Dan Fischetti</a> from <a href="https://standard.ai/" target="_blank" rel="nofollow noopener noreferrer">Standard Cognition</a>, who joined our Dmitry Petrov on stage. Watch the recording here.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/RHQXK7EC0jI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> </li> <li> <p><a href="https://blog.dataversioncontrol.com/dvc-org-for-hacktoberfest-2019-ce5320151a0c" target="_blank" rel="nofollow noopener noreferrer">Hacktoberfest</a> was a great exercise for DVC team on many levels and we really enjoyed supporting new contributors. Kudos to <a href="https://twitter.com/explorer_07" target="_blank" rel="nofollow noopener noreferrer">Nabanita Dash</a> for organizing a cool DVC-themed hackathon!</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Our open source event Hacktoberfest-themed meet-up was a success. Thanks to <a href="https://twitter.com/DVCorg">@DVCorg</a> and it's mentors for all the hard work. <br>Some of our attendees made their first PR on DVC and got them merged. Kudos to the team! <br>PS: 🍕 was the second best thing of the evening. <a href="https://t.co/zAWC0TVlPd">pic.twitter.com/zAWC0TVlPd</a></p>— Programming Society IIIT-Bh (@psociiit) <a href="https://twitter.com/psociiit/status/1185150096792535040">October 18, 2019</a></blockquote> </li> <li> <p>We’ve crossed 4k stars mark on <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">Github</a>!</p> </li> <li> <p>DVC was participating in the <a href="https://twitter.com/FossMec/status/1192866498324254720" target="_blank" rel="nofollow noopener noreferrer">Devsprints</a> (Thank you <a href="https://twitter.com/kurianbenoy2" target="_blank" rel="nofollow noopener noreferrer">Kurian Benoy</a> for the intro!) and we were happy to jump in and help with some mentoring.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Thank you <a href="https://twitter.com/DVCorg">@DVCorg</a> for participating in the Devsprints, by <a href="https://twitter.com/FossMec">@FossMEC</a> and <a href="https://twitter.com/excelmec">@excelmec</a>. We had <a href="https://twitter.com/shcheklein">@shcheklein</a> who joined us all the way from SF and explained how open source is boosting the future. Srinidhi and <a href="https://twitter.com/kurianbenoy2">@kurianbenoy2</a> helped participants get started to contributing to the project.</p>— FOSS MEC (@FossMec) <a href="https://twitter.com/FossMec/status/1192866498324254720">November 8, 2019</a></blockquote> </li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/1fe957ddccf9aa3e7bb643d8e8ea8bed/39600/devsprints.png" alt="devsprints" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Devsprints participants on our <a href="http://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">Discord</a> channel</em></p> <ul> <li> <p>DVC became part of the default <a href="https://formulae.brew.sh/formula/dvc" target="_blank" rel="nofollow noopener noreferrer">Homebrew formulae</a>! So now you can install it as easy as <code>brew install dvc</code>!</p> </li> <li> <p>We helped 2 aspiring speakers deliver their very first conference talks. <a href="https://twitter.com/kurianbenoy2/status/1183427495342694401?s=20" target="_blank" rel="nofollow noopener noreferrer">Kurian Benoy</a> was speaking at <a href="https://in.pycon.org/2019/" target="_blank" rel="nofollow noopener noreferrer">PyconIndia</a> and <a href="https://www.linkedin.com/in/aman-sharma606/" target="_blank" rel="nofollow noopener noreferrer">Aman Sharma</a> was speaking at <a href="https://scipy.in/2019#speakers" target="_blank" rel="nofollow noopener noreferrer">SciPyIndia</a>. <strong>Supporting speakers is something we are passionate about and if you ever wanted to give a talk on a DVC-related topic — we are here to help, just <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">let us know</a>!</strong></p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/Ipzf6oQqQpo?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> </li> <li> <p>Our own <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> went to Europe to speak at the <a href="https://osseu19.sched.com/speaker/dmitry35" target="_blank" rel="nofollow noopener noreferrer">Open Source Summit Europe</a> in Lyon, <a href="https://www.highload.ru/moscow/2019/abstracts/6032" target="_blank" rel="nofollow noopener noreferrer">Highload++</a> in Moscow and made a stop in in Berlin to co-host a <a href="https://www.meetup.com/codecentric-Berlin/events/265555810/" target="_blank" rel="nofollow noopener noreferrer">meetup</a> with our favourite AI folks from <a href="https://www.codecentric.de/" target="_blank" rel="nofollow noopener noreferrer">Codecentric</a>!</p> </li> </ul> <hr> <p>Here are some of the great pieces of content around DVC and ML ops that we discovered in October and November:</p> <ul> <li><strong><a href="https://www.deploymachinelearning.com/" target="_blank" rel="nofollow noopener noreferrer">Deploy Machine Learning Models with Django</a> by Piotr Płoński.</strong></li> </ul> <blockquote> <p>…building your ML system has a great advantage — it is tailored to your needs. It has all features that are needed in your ML system and can be as complex as you wish. This tutorial is for readers who are familiar with ML and would like to learn how to build ML web services.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://www.deploymachinelearning.com/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Deploy Machine Learning Models with Django</h4> <div class="elp-description">Version 1.0 (04/11/2019) Piotr Płoński The demand for Machine Learning (ML) applications is growing. Many resources…</div> <div class="elp-link">deploymachinelearning.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-12-14/deploy-machine-learning-models-7cb0da3c9268f3e33159cdd160d56e13.png" alt="Deploy Machine Learning Models with Django"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://towardsdatascience.com/how-to-manage-your-machine-learning-workflow-with-dvc-weights-biases-and-docker-5529ea4e59e0" target="_blank" rel="nofollow noopener noreferrer">How to Manage Your Machine Learning Workflow with DVC, Weights & Biases, and Docker</a> by <a href="https://le-james94.medium.com" target="_blank" rel="nofollow noopener noreferrer">James Le</a>.</strong></li> </ul> <blockquote> <p>In this article, I want to show 3 powerful tools to simplify and scale up machine learning development within an organization by making it easy to track, reproduce, manage, and deploy models.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/how-to-manage-your-machine-learning-workflow-with-dvc-weights-biases-and-docker-5529ea4e59e0" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">How to Manage Your Machine Learning Workflow withDVC, Weights & Biases, and Docker</h4> <div class="elp-description">Managing a machine learning workflow is hard!</div> <div class="elp-link">towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-12-14/how-to-manage-your-machine-learning-workflow-c8cf3d6c0b055c1bd0d9bfa4f8e6d4da.jpeg" alt="How to Manage Your Machine Learning Workflow withDVC, Weights & Biases, and Docker"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34" target="_blank" rel="nofollow noopener noreferrer">Creating a solid Data Science development environment</a> by <a href="https://towardsdatascience.com/@gabrielsgoncalves" target="_blank" rel="nofollow noopener noreferrer">Gabriel dos Santos Goncalves</a></strong></li> </ul> <blockquote> <p>We do believe that Data Science is a field that can become even more mature by using best practices in project development and that Conda, Git, DVC, and JupyterLab are key components of this new approach</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/creating-a-solid-data-science-development-environment-60df14ce3a34" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Creating a solid Data Science development environment</h4> <div class="elp-description">How to organize and replicate your development environment using Conda, Git, DVC, and JupyterLab.</div> <div class="elp-link">towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-12-14/creating-solid-data-science-dev-env-48e9ffb886f0ec17a2cfc0deab709dc6.png" alt="Creating a solid Data Science development environment"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b" target="_blank" rel="nofollow noopener noreferrer">Creating reproducible data science workflows with DVC</a> by <a href="https://medium.com/@glib.ivashkevych" target="_blank" rel="nofollow noopener noreferrer">Gleb Ivashkevich</a>.</strong></li> </ul> <blockquote> <p>DVC is a powerful tool and we covered only the fundamentals of it.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/y-data-stories/creating-reproducible-data-science-workflows-with-dvc-3bf058e9797b" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Creating reproducible data science workflows with DVC</h4> <div class="elp-description">Getting started” tutorial into DVC to make a structure and order in your daily ML routine</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-12-14/creating-reproducible-data-science-workflows-60ff778cfaeb3cacd8fe82988fe696eb.jpeg" alt="Creating reproducible data science workflows with DVC"> </div> </a> </section> <p></p> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We are sifting through the issues and discussions and share with you the most interesting takeaways.</p> <h3 id="q-when-you-do-a-dvc-import-you-get-the-state-of-the-data-in-the-original-repo-at-that-moment-in-time-from-that-repo-right-the-overall-state-of-that-repo-eg-git-commit-id-hash-is-not-preserved-upon-import-right" style="position:relative;">Q: When you do a <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> you get the state of the data in the original repo at that moment in time from that repo, right? <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/618744949277458462" target="_blank" rel="nofollow noopener noreferrer">The overall state of that repo (e.g. Git <code>commit id</code> (hash)) is not preserved upon import, right?</a><a href="#q-when-you-do-a-dvc-import-you-get-the-state-of-the-data-in-the-original-repo-at-that-moment-in-time-from-that-repo-right-the-overall-state-of-that-repo-eg-git-commit-id-hash-is-not-preserved-upon-import-right" aria-label="q when you do a dvc import you get the state of the data in the original repo at that moment in time from that repo right the overall state of that repo eg git commit id hash is not preserved upon import right permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>On the contrary, DVC relies on Git <code>commit id</code> (hash) to determine the state of the data as well as code. Git <code>commit id</code> (hash) is saved in DVC file upon import, data itself is copied/downloaded into DVC repo cache but would not be pushed to the remote — DVC does not create duplicates. There is a command to advance/update it when it’s needed — <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a>. Git commit hash saved to provide reproducibility. Even if the source repo <code>HEAD</code> has changed your import stays the same until you run <a href="https://dvc.org/doc/command-reference/update"><code>dvc update</code></a> or redo <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>.</p> <h3 id="q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" style="position:relative;">Q: I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621057268145848340" target="_blank" rel="nofollow noopener noreferrer">That means that permanent deletion of files with sensitive data needs to be fully supported.</a><a href="#q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" aria-label="q im trying to understand if dvc is an appropriate solution for storing data under gdpr requirements that means that permanent deletion of files with sensitive data needs to be fully supported permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, in this sense DVC is not very different from using bare S3, SSH or any other storage where you can go and just delete data. DVC can give a bit of overhead to locate a specific file to delete, but otherwise it’s all the same you will be able to delete any file you want. Read more details in <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621062105524862987" target="_blank" rel="nofollow noopener noreferrer">this discussion</a>.</p> <h3 id="q-is-there-anyway-to-get-the-remote-url-for-specific-dvc-files-say-i-have-a-dvc-file-foopngdvc--is-there-a-command-that-will-show-the-remote-url-something-like-dvc-get-remote-url-foopngdvc-which-will-return-eg-the-azure-url-to-download" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621591769766821888" target="_blank" rel="nofollow noopener noreferrer">Is there anyway to get the remote url for specific DVC-files?</a> Say, I have a DVC-file <code>foo.png.dvc</code> — is there a command that will show the remote url, something like <code>dvc get-remote-url foo.png.dvc</code> which will return e.g. the Azure url to download.<a href="#q-is-there-anyway-to-get-the-remote-url-for-specific-dvc-files-say-i-have-a-dvc-file-foopngdvc--is-there-a-command-that-will-show-the-remote-url-something-like-dvc-get-remote-url-foopngdvc-which-will-return-eg-the-azure-url-to-download" aria-label="q is there anyway to get the remote url for specific dvc files say i have a dvc file foopngdvc is there a command that will show the remote url something like dvc get remote url foopngdvc which will return eg the azure url to download permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There is no special command for that, but if you are using Python, you could use our API specifically designed for that:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">from</span> dvc<span class="token punctuation">.</span>api <span class="token keyword">import</span> get_url url <span class="token operator">=</span> get_url<span class="token punctuation">(</span>path<span class="token punctuation">,</span> repo<span class="token operator">=</span><span class="token string">"https://github.com/user/proj"</span><span class="token punctuation">,</span> rev<span class="token operator">=</span><span class="token string">"mybranch"</span><span class="token punctuation">)</span></code></pre></div> <p>so, you could as well use this from CLI as a wrapper command.</p> <h3 id="q-can-dvc-be-integrated-with-ms-active-directory-ad-authentication-for-controlling-access-the-gdpr-requirements-would-force-me-to-use-such-a-system-to-manage-access" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/619244714071425035" target="_blank" rel="nofollow noopener noreferrer">Can DVC be integrated with MS Active Directory (AD) authentication for controlling access?</a> The GDPR requirements would force me to use such a system to manage access.<a href="#q-can-dvc-be-integrated-with-ms-active-directory-ad-authentication-for-controlling-access-the-gdpr-requirements-would-force-me-to-use-such-a-system-to-manage-access" aria-label="q can dvc be integrated with ms active directory ad authentication for controlling access the gdpr requirements would force me to use such a system to manage access permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Short answer: no (as of the date of publishing this Heartbeat issue) Good news — it should be very easy to add, so we would welcome a contribution :) Azure has a connection argument for AD — quick googling shows this <a href="https://github.com/AzureAD/azure-activedirectory-library-for-python" target="_blank" rel="nofollow noopener noreferrer">library</a>, which is what probably needed.</p> <h3 id="q-how-do-i-uninstall-dvc-from-mac-installed-as-a-package" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/625124341201502209" target="_blank" rel="nofollow noopener noreferrer">How do I uninstall DVC from Mac installed as a package?</a><a href="#q-how-do-i-uninstall-dvc-from-mac-installed-as-a-package" aria-label="q how do i uninstall dvc from mac installed as a package permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When installing using <code>plain.pkg</code> it is a bit tricky to uninstall, so we usually recommend using things like brew cask instead if you really need the binary package. Try to run these commands:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-rf</span> /usr/local/bin/dvc </span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-rf</span> /usr/local/lib/dvc </span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> pkgutil <span class="token parameter variable">--forget</span> com.iterative.dvc</span></code></pre></div> <p>to uninstall the package.</p> <h3 id="q-we-are-using-ssh-remote-to-store-data-but-the-problem-is-that-everyone-within-the-project-has-different-username-on-the-remote-machine-and-thus-we-cannot-set-it-in-the-config-file-that-is-committed-to-git-is-there-a-way-to-add-just-host-and-path-without-the-username" style="position:relative;">Q: We are using SSH remote to store data, but the problem is that everyone within the project has different username on the remote machine and thus we cannot set it in the config file (that is committed to Git). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/619420070111608848" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add just host and path, without the username?</a><a href="#q-we-are-using-ssh-remote-to-store-data-but-the-problem-is-that-everyone-within-the-project-has-different-username-on-the-remote-machine-and-thus-we-cannot-set-it-in-the-config-file-that-is-committed-to-git-is-there-a-way-to-add-just-host-and-path-without-the-username" aria-label="q we are using ssh remote to store data but the problem is that everyone within the project has different username on the remote machine and thus we cannot set it in the config file that is committed to git is there a way to add just host and path without the username permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, you should use <code>--local</code> or <code>--global</code> config options to set user per project or per use machine without sharing (committing) them to Git:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote —local user myuser</span></code></pre></div> <p>or</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote —global user myuser</span></code></pre></div> <h3 id="q-i-still-get-the-ssl-error-when-i-try-to-perform-a-dvc-push-with-or-without-use_ssl--false" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/628227197592797191" target="_blank" rel="nofollow noopener noreferrer">I still get the <code>SSL ERROR</code> when I try to perform a dvc push with or without <code>use_ssl = false</code></a>?<a href="#q-i-still-get-the-ssl-error-when-i-try-to-perform-a-dvc-push-with-or-without-use_ssl--false" aria-label="q i still get the ssl error when i try to perform a dvc push with or without use_ssl false permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>A simple environment variable like this:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">AWS_CA_BUNDLE</span><span class="token operator">=</span>/path/to/cert/cert.crt dvc push</span></code></pre></div> <p>should do the trick for now, we plan to fix the ca_bundle option soon.</p> <h3 id="q-i-have-just-finished-a-lengthy-dvc-repro-and-im-happy-with-the-result-however-i-realized-that-i-didnt-specify-a-dependency-which-i-needed-and-obviously-is-used-in-the-computation-can-i-somehow-fix-it" style="position:relative;">Q: I have just finished a lengthy <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> and I’m happy with the result. However, I realized that I didn’t specify a dependency which I needed (and obviously is used in the computation). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/620572187841265675" target="_blank" rel="nofollow noopener noreferrer">Can I somehow fix it?</a><a href="#q-i-have-just-finished-a-lengthy-dvc-repro-and-im-happy-with-the-result-however-i-realized-that-i-didnt-specify-a-dependency-which-i-needed-and-obviously-is-used-in-the-computation-can-i-somehow-fix-it" aria-label="q i have just finished a lengthy dvc repro and im happy with the result however i realized that i didnt specify a dependency which i needed and obviously is used in the computation can i somehow fix it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Add the dependency to the stage file without rerunning/reproducing the stage. This is not needed as this additional dependency hasn’t changed.</p> <p>You would need to edit the DVC-file. In the deps section add:</p> <div class="gatsby-highlight" data-language="yaml"><pre class="language-yaml"><code class="language-yaml"><span class="token key atrule">-path</span><span class="token punctuation">:</span> not/included/file/path</code></pre></div> <p>and run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit file.dvc</code></a> to save changes w/o running the pipeline again. See an example <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/620641530075414570" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="q-for-some-reason-we-need-to-always-specify-the-remote-name-when-doing-a-dvc-push-eg-dvc-push--r-upstream-as-opposed-to-dvc-push-mind-no-additional-arguments" style="position:relative;">Q: For some reason <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/629704961868955648" target="_blank" rel="nofollow noopener noreferrer">we need to always specify the remote name when doing a <code>dvc push</code></a> e.g., <a href="https://dvc.org/doc/command-reference/push#-r"><code>dvc push -r upstream</code></a> as opposed to <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> (mind no additional arguments).<a href="#q-for-some-reason-we-need-to-always-specify-the-remote-name-when-doing-a-dvc-push-eg-dvc-push--r-upstream-as-opposed-to-dvc-push-mind-no-additional-arguments" aria-label="q for some reason we need to always specify the remote name when doing a dvc push eg dvc push r upstream as opposed to dvc push mind no additional arguments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can mark a “default” remote:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> remote /path/to/my/main/remote</span></code></pre></div> <p>then, <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> (and other commands like <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>) will know to push to the default</p> <h3 id="q-if-i-want-stage-b-to-run-after-stage-a-but-the-stage-a-has-no-output-can-i-specify-as-dvc-file-as-bs-dependency" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/620715145374466048" target="_blank" rel="nofollow noopener noreferrer">If I want stage B to run after stage A, but the stage A has no output, can I specify A’s DVC-file as B’s dependency?</a><a href="#q-if-i-want-stage-b-to-run-after-stage-a-but-the-stage-a-has-no-output-can-i-specify-as-dvc-file-as-bs-dependency" aria-label="q if i want stage b to run after stage a but the stage a has no output can i specify as dvc file as bs dependency permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>No, at least at the time of publishing this. You could use a phony output though. E.g. make the stage A output some dummy file and make B depend on it. Please, consider creating or upvoting a relevant issue on our Github if you’d this to be implemented.</p> <h3 id="q-im-just-getting-started-with-dvc-but-id-like-to-use-it-for-multiple-developers-to-access-the-data-and-share-models-and-code-i-do-own-the-server-but-im-not-sure-how-to-use-dvc-with-ssh-remote" style="position:relative;">Q: I’m just getting started with DVC, but I’d like to use it for multiple developers to access the data and share models and code. <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/598867829785362452" target="_blank" rel="nofollow noopener noreferrer">I do own the server, but I’m not sure how to use DVC with SSH remote?</a><a href="#q-im-just-getting-started-with-dvc-but-id-like-to-use-it-for-multiple-developers-to-access-the-data-and-share-models-and-code-i-do-own-the-server-but-im-not-sure-how-to-use-dvc-with-ssh-remote" aria-label="q im just getting started with dvc but id like to use it for multiple developers to access the data and share models and code i do own the server but im not sure how to use dvc with ssh remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Please, refer to <a href="https://discuss.dvc.org/t/how-do-i-use-dvc-with-ssh-remote/279/2" target="_blank" rel="nofollow noopener noreferrer">this answer</a> on the DVC forum and check the documentation for the <a href="https://dvc.org/doc/command-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer"><code>dvc remote add</code></a> and <a href="https://dvc.org/doc/command-reference/remote/modify" target="_blank" rel="nofollow noopener noreferrer"><code>dvc remote modify</code></a> commands to see more options and details.</p> <hr> <p>If you have any questions, concerns or ideas, let us know in the comments below or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/october-19-dvc-heartbeathttps://dvc.org/blog/october-19-dvc-heartbeatTue, 05 Nov 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Autumn is a great season for new beginnings and there is so much we love about it this year. Here are some of the highlights:</p> <ul> <li> <p>Co-hosting our <a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" target="_blank" rel="nofollow noopener noreferrer">first ever meetup</a>! Our <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> partnering with <a href="https://www.linkedin.com/in/daniel-fischetti-4a6592bb/" target="_blank" rel="nofollow noopener noreferrer">Dan Fischetti</a> from <a href="https://twitter.com/standardAI" target="_blank" rel="nofollow noopener noreferrer">Standard Cognition</a> to discuss Open-source tools to version control Machine Learning models and experiments. The recording is available now here.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/RHQXK7EC0jI?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> </li> <li> <p><a href="https://blog.dataversioncontrol.com/dvc-org-for-hacktoberfest-2019-ce5320151a0c" target="_blank" rel="nofollow noopener noreferrer">Getting ready for the Hacktoberfest</a> and having the whole team get together to pick up and label nice issues and be ready to support the contributors.</p> </li> <li> <p>Discovering some really cool blogposts, talks and tutorials from our users all over the world: check <a href="https://blog.octo.com/mise-en-application-de-dvc-sur-un-projet-de-machine-learning/" target="_blank" rel="nofollow noopener noreferrer">this blogpost in French</a> or <a href="https://jupyter-tutorial.readthedocs.io/de/latest/productive/dvc/" target="_blank" rel="nofollow noopener noreferrer">this tutorial in German</a>!</p> </li> <li> <p>Having a great time working with a <a href="https://github.com/dashohoxha" target="_blank" rel="nofollow noopener noreferrer">tech writer</a> brought to us by the <a href="https://developers.google.com/season-of-docs" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a> program. Check out these <a href="https://dvc.org/doc/tutorials/interactive" target="_blank" rel="nofollow noopener noreferrer">interactive tutorials</a> we’ve created together.</p> </li> <li> <p>Having hot internal discussion about Discord vs Slack support/community channels. If you are on the edge like us, have a look at <a href="https://internals.rust-lang.org/t/exploring-new-communication-channels/7859" target="_blank" rel="nofollow noopener noreferrer">this discussion</a> in the Rust community, so helpful.</p> </li> <li> <p>Seeing <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> being really happy one day:</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">.<a href="https://twitter.com/martinfowler">@martinfowler</a>'s books and his website were always the source of programming wisdom 💎 His Refactoring book is the first book I recommend to developers.<br><br>Now they write about ML lifecycle and automation. I’m especially excited because they use <a href="https://twitter.com/DVCorg">@DVCorg</a> that we’ve created. <a href="https://t.co/HwswZqjOsb">https://t.co/HwswZqjOsb</a></p>— Dmitry Petrov (@FullStackML) <a href="https://twitter.com/FullStackML/status/1169403554290814976">September 5, 2019</a></blockquote> </li> </ul> <hr> <p>We at <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a> are so happy every time we discover an article featuring DVC or addressing one of the burning ML issues we are trying to solve. Here are some of the links that caught our eye past month:</p> <ul> <li><strong>Continuous Delivery for Machine Learning by <a href="https://twitter.com/dtsato" target="_blank" rel="nofollow noopener noreferrer">Danilo Sato</a>, <a href="https://twitter.com/arifwider" target="_blank" rel="nofollow noopener noreferrer">Arif Wider</a>, <a href="https://twitter.com/intellification" target="_blank" rel="nofollow noopener noreferrer">Christoph Windheuser</a> and curated by <a href="https://martinfowler.com/" target="_blank" rel="nofollow noopener noreferrer">Martin Fowler</a>.</strong></li> </ul> <blockquote> <p>As Machine Learning techniques continue to evolve and perform more complex tasks, so is evolving our knowledge of how to manage and deliver such applications to production. By bringing and extending the principles and practices from Continuous Delivery, we can better manage the risks of releasing changes to Machine Learning applications in a safe and reliable way.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://martinfowler.com/articles/cd4ml.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Continuous Delivery for Machine Learning</h4> <div class="elp-description">bio I am a consultant at ThoughtWorks Germany, where I am leading our data and machine learning activities. I enjoy…</div> <div class="elp-link">martinfowler.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-11-05/continuous-delivery-for-machine-learning-fd9ed27a4534371abdb90fcb1e5d1fb3.png" alt="Continuous Delivery for Machine Learning"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://medium.com/signaturit-tech-blog/the-path-to-identity-validation-2-3-4f698b2ffae9" target="_blank" rel="nofollow noopener noreferrer">The Path to Identity Validation</a> by <a href="https://medium.com/@victor.segura" target="_blank" rel="nofollow noopener noreferrer">Víctor Segura</a>.</strong></li> </ul> <blockquote> <p>So, the first question is clear: how to choose the optimal hardware for neural networks? Secondly, assuming that we have the appropriate infrastructure, how to build the machine learning ecosystem to train our models efficiently and not die trying? At <strong>Signaturit</strong>, we have the solution ;)</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/signaturit-tech-blog/the-path-to-identity-validation-2-3-4f698b2ffae9" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The Path to Identity Validation (2/3)</h4> <div class="elp-description">How to start your own machine learning project?</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-11-05/the-path-to-identity-validation-51339e974d8ad0c70b28bffb4cacd674.jpeg" alt="The Path to Identity Validation (2/3)"> </div> </a> </section> <p></p> <ul> <li><strong>Talk: <a href="https://pretalx.com/pyconuk-2019/talk/GCLBFH/" target="_blank" rel="nofollow noopener noreferrer">Managing Big Data in Machine Learning projects</a> by <a href="https://twitter.com/vvasworld" target="_blank" rel="nofollow noopener noreferrer">V Vishnu Anirudh</a> at the <a href="https://2019.pyconuk.org/" target="_blank" rel="nofollow noopener noreferrer">Pycon UK 2019.</a></strong></li> </ul> <blockquote> <p>My talk will focus on Version Control Systems (VCS) for big-data projects. With the advent of Machine Learning (ML) , the development teams find it increasingly difficult to manage and collaborate on projects that deal with huge amounts of data and ML models apart from just source code.</p> </blockquote> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/4XpHk85_x0E?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <ul> <li><strong>Podcast: TWIML Talk #295 <a href="https://twimlai.com/twiml-talk-295-managing-deep-learning-experiments-with-lukas-biewald/" target="_blank" rel="nofollow noopener noreferrer">Managing Deep Learning Experiments</a> with <a href="https://twitter.com/l2k" target="_blank" rel="nofollow noopener noreferrer">Lukas Biewald</a></strong></li> </ul> <blockquote> <p>Seeing a need for reproducibility in deep learning experiments, Lukas founded Weights & Biases. In this episode we discuss his experiment tracking tool, how it works, the components that make it unique in the ML marketplace and the open, collaborative culture that Lukas promotes. Listen to Lukas delve into how he got his start in deep learning experiments, what his experiment tracking used to look like, the current Weights & Biases business success strategy, and what his team is working on today.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://twimlai.com/twiml-talk-295-managing-deep-learning-experiments-with-lukas-biewald/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Managing Deep Learning Experiments with Lukas Biewald — Talk #295</h4> <div class="elp-description">Today we are joined by Lukas Biewald, CEO and Co-Founder of Weights & Biases. Lukas, previously CEO and Founder of…</div> <div class="elp-link">twimlai.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-11-05/managing-deep-learning-experiments-cadd691f2bc6783395192d5944ad571a.jpeg" alt="Managing Deep Learning Experiments with Lukas Biewald — Talk #295"> </div> </a> </section> <p></p> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We are sifting through the issues and discussions and share with you the most interesting takeaways.</p> <h3 id="q-ive-just-run-a-dvc-run-step-and-realised-i-forgot-to-declare-an-output-file-is-there-a-way-to-add-an-output-file-without-rerunning-the-computationally-expensive-stepstage" style="position:relative;">Q: I’ve just run a <code>dvc run</code> step, and realised I forgot to declare an output file. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/593743448020877323" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add an output file without rerunning the (computationally expensive) step/stage?</a><a href="#q-ive-just-run-a-dvc-run-step-and-realised-i-forgot-to-declare-an-output-file-is-there-a-way-to-add-an-output-file-without-rerunning-the-computationally-expensive-stepstage" aria-label="q ive just run a dvc run step and realised i forgot to declare an output file is there a way to add an output file without rerunning the computationally expensive stepstage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you’ve already ran it, you could just open created DVC-file with an editor and add an entry to the outs field. After that, just run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit my.dvc</code></a> and it will save the checksums and data without re-running your command. <code>dvc run --no-exec</code> would also work with commit instead of modifying the DVC-file by hand.</p> <h3 id="q-for-metric-files-do-i-have-to-use-dvc-run-to-set-a-metric-or-can-i-do-it-some-other-way-can-i-use-metrics-functionality-without-the-need-to-setup-and-manage-dvc-cache-and-remote-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/593869598651318282" target="_blank" rel="nofollow noopener noreferrer">For metric files do I have to use dvc run to set a metric or can I do it some other way?</a> Can I use metrics functionality without the need to setup and manage DVC cache and remote storage?<a href="#q-for-metric-files-do-i-have-to-use-dvc-run-to-set-a-metric-or-can-i-do-it-some-other-way-can-i-use-metrics-functionality-without-the-need-to-setup-and-manage-dvc-cache-and-remote-storage" aria-label="q for metric files do i have to use dvc run to set a metric or can i do it some other way can i use metrics functionality without the need to setup and manage dvc cache and remote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Any file that is under DVC control (e.g. added with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> or an output in <code>dvc run -o</code>) can be made a metric file with dvc metrics add file. Alternatively a command <code>dvc run -M</code> file makes file a metric without caching it. It means dvc metrics show can be used while file is still versioned by Git.</p> <h3 id="q-is-there-a-way-not-to-add-the-full-azure-connection-string-to-the-dvcconfig-file-that-is-being-checked-into-git-for-using-dvc-remotes-i-think-its-quite-unhealthy-to-have-secrets-checked-in-scm" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/595586670498283520" target="_blank" rel="nofollow noopener noreferrer">Is there a way not to add the full (Azure) connection string to the .dvc/config file that is being checked into Git for using dvc remotes</a>? I think it’s quite unhealthy to have secrets checked in SCM.<a href="#q-is-there-a-way-not-to-add-the-full-azure-connection-string-to-the-dvcconfig-file-that-is-being-checked-into-git-for-using-dvc-remotes-i-think-its-quite-unhealthy-to-have-secrets-checked-in-scm" aria-label="q is there a way not to add the full azure connection string to the dvcconfig file that is being checked into git for using dvc remotes i think its quite unhealthy to have secrets checked in scm permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are two options — use <code>AZURE_STORAGE_CONNECTION_STRING</code> environment variable or use <code>--local</code> flag that will put it into the <code>.dvc/config.local</code> that is added to the <code>.gitignore</code>, so you don’t track it with it and so won’t expose secrets.</p> <h3 id="q-i-would-like-to-know-if-it-is-possible-to-manage-files-under-dvc-whilst-keeping-them-in-their-original-locations-eg-on-a-network-drive-in-a-given-folder-structure-if-i-want-to-add-a-large-file-to-be-tracked-by-dvc-and-it-is-in-a-bucket-on-s3-or-gcs-can-i-do-that-without-downloading-it-locally" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/601068667131920385" target="_blank" rel="nofollow noopener noreferrer">I would like to know if it is possible to manage files under DVC whilst keeping them in their original locations (e.g. on a network drive in a given folder structure)</a>? <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/615278138896941101" target="_blank" rel="nofollow noopener noreferrer">If I want to add a large file to be tracked by DVC, and it is in a bucket on S3 or GCS, can I do that without downloading it locally?</a><a href="#q-i-would-like-to-know-if-it-is-possible-to-manage-files-under-dvc-whilst-keeping-them-in-their-original-locations-eg-on-a-network-drive-in-a-given-folder-structure-if-i-want-to-add-a-large-file-to-be-tracked-by-dvc-and-it-is-in-a-bucket-on-s3-or-gcs-can-i-do-that-without-downloading-it-locally" aria-label="q i would like to know if it is possible to manage files under dvc whilst keeping them in their original locations eg on a network drive in a given folder structure if i want to add a large file to be tracked by dvc and it is in a bucket on s3 or gcs can i do that without downloading it locally permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, you are probably looking for external dependencies and outputs. This is the <a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">link</a> to the documentation to start.</p> <h3 id="q-how-do-i-setup-dvc-so-that-nas-eg-synology-acts-as-a-shared-dvc-cache" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/606388040377565215" target="_blank" rel="nofollow noopener noreferrer">How do I setup DVC so that NAS (e.g. Synology) acts as a shared DVC cache?</a><a href="#q-how-do-i-setup-dvc-so-that-nas-eg-synology-acts-as-a-shared-dvc-cache" aria-label="q how do i setup dvc so that nas eg synology acts as a shared dvc cache permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Using NAS (e.g. NFS) is a very common scenario for DVC. In short you use <a href="https://dvc.org/doc/command-reference/cache/dir"><code>dvc cache dir</code></a> to setup a cache externally. Set cache type to use symlinks and enable protected mode. We are preparing a <a href="https://github.com/iterative/dvc.org/blob/31c5d424c6530bb793af69c2af578d2b8a374d02/static/docs/use-cases/shared-storage-on-nfs.md" target="_blank" rel="nofollow noopener noreferrer">document</a> how to setup the NFS as a shared cache, but I think it can be applied to any NAS.</p> <h3 id="q-so-i-have-some-data-that-is-in-the-hundreds-of-gigs-if-i-enable-symlink-hardlink-strategy-and-cache-protecting-will-dvc-automatically-choose-this-strategy-over-copying-when-trying-to-use-dvc-add" style="position:relative;">Q: So I have some data that is in the hundreds of gigs. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/608013531010301952" target="_blank" rel="nofollow noopener noreferrer">If I enable symlink, hardlink strategy and cache protecting, will DVC automatically choose this strategy over copying when trying to use dvc add</a>?<a href="#q-so-i-have-some-data-that-is-in-the-hundreds-of-gigs-if-i-enable-symlink-hardlink-strategy-and-cache-protecting-will-dvc-automatically-choose-this-strategy-over-copying-when-trying-to-use-dvc-add" aria-label="q so i have some data that is in the hundreds of gigs if i enable symlink hardlink strategy and cache protecting will dvc automatically choose this strategy over copying when trying to use dvc add permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, it will! Here is some clarification. So when you set those settings like that, <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> data will move data to your cache and then will create a hardlink from your cache to your workspace.</p> <p>Unless your cache directory and your workspace are on different file systems, move should be instant. Please, find more information <a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="q-my-repos-dvc-is-busy-and-locked-and-im-not-sure-how-it-got-that-way-and-how-to-removediagnose-the-lock-any-suggestions" style="position:relative;">Q: My repo’s DVC is “busy and locked” and I’m not sure how it got that way and how to remove/diagnose the lock. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/608392956679815168" target="_blank" rel="nofollow noopener noreferrer">Any suggestions?</a><a href="#q-my-repos-dvc-is-busy-and-locked-and-im-not-sure-how-it-got-that-way-and-how-to-removediagnose-the-lock-any-suggestions" aria-label="q my repos dvc is busy and locked and im not sure how it got that way and how to removediagnose the lock any suggestions permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC uses a lock file to prevent running two commands at the same time. The lock <a href="https://dvc.org/doc/user-guide/dvc-internals" target="_blank" rel="nofollow noopener noreferrer">file</a> is under the <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> directory. If no DVC commands running and you are still getting this error it’s safe to remove this file manually to resolve the issue.</p> <h3 id="q-im-trying-to-understand-how-does-dvc-remote-add-work-in-case-of-a-local-folder-and-what-is-the-best-workflow-when-data-is-outside-of-your-project-root" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/611209851757920266" target="_blank" rel="nofollow noopener noreferrer">I’m trying to understand how does DVC remote add work in case of a local folder and what is the best workflow when data is outside of your project root?</a><a href="#q-im-trying-to-understand-how-does-dvc-remote-add-work-in-case-of-a-local-folder-and-what-is-the-best-workflow-when-data-is-outside-of-your-project-root" aria-label="q im trying to understand how does dvc remote add work in case of a local folder and what is the best workflow when data is outside of your project root permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>When using DVC, in most cases we assume that your data will be somewhere under project root. There is an option to use so called <a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">external dependencies</a>, which is data that is usually too big to be stored under your project root, but if you operate on data that is of some reasonable size, I would recommend starting with putting data somewhere under project root. Remotes are usually places where you store your data, but it is DVC task to move your data around. But if you want to keep your current setup where you will have data in different place than your project, you will need to refer to data with full paths. So, for example:</p> <ol> <li> <p>You are in <code>/home/gabriel/myproject</code> and you have initialized dvc and git repository</p> </li> <li> <p>You have <code>featurize.py</code> in your project dir, and want to use data to produce some features and than <code>train.py</code> to train a model.</p> </li> <li> <p>Run the command:</p> </li> </ol> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> /research_data/myproject/videos <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> /research_data/myproject/features <span class="token punctuation">\</span> python featurize.py</span></code></pre></div> <p>to tell DVC, that you use <code>/research_data/myproject/videos</code> to featurize, and produce output to your features dir. Note that your code should be aware of those paths, they can be hardcoded inside <code>featurize.py</code>, but point of <code>dvc run</code> is just to tell DVC what artifacts belong to currently defined step of ML pipeline.</p> <h3 id="q-when-i-run-du-command-to-check-how-much-space-dvc-project-consumes-i-see-that-it-duplicatescopies-data-its-very-space-and-time-consuming-to-copy-large-data-files-is-there-a-way-to-avoid-that-it-takes-too-long-to-add-large-files-to-dvc" style="position:relative;">Q: When I run <code>du</code> command to check how much space DVC project consumes I see that it duplicates/copies data. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/613935477896249364" target="_blank" rel="nofollow noopener noreferrer">It’s very space and time consuming to copy large data files, is there a way to avoid that?</a> It takes too long to add large files to DVC.<a href="#q-when-i-run-du-command-to-check-how-much-space-dvc-project-consumes-i-see-that-it-duplicatescopies-data-its-very-space-and-time-consuming-to-copy-large-data-files-is-there-a-way-to-avoid-that-it-takes-too-long-to-add-large-files-to-dvc" aria-label="q when i run du command to check how much space dvc project consumes i see that it duplicatescopies data its very space and time consuming to copy large data files is there a way to avoid that it takes too long to add large files to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes! You don’t have to copy files with DVC. First of all, there are two reasons when du can show that it takes double the space to store data under DVC control. du can be inaccurate when the underlying file system supports reflinks (XFS on Linux, APFS on Mac, etc). This is actually the best scenario since no copying is happening and no changes are required to any DVC settings. Second, case means that copy semantics is used by default. It can be turned off by providing cache type <code>symlinks</code>, <code>hardlinks</code>. Please, read more on this <a href="https://dvc.org/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="q-how-can-i-detach-a-file-from-dvc-control" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/615479227189559323" target="_blank" rel="nofollow noopener noreferrer">How can I detach a file from DVC control?</a><a href="#q-how-can-i-detach-a-file-from-dvc-control" aria-label="q how can i detach a file from dvc control permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Just removing the corresponding DVC-file and running <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> after that should be enough. It’ll stop tracking the data file and clean the local cache that might still contain it. Note! Don’t forget to run <a href="https://dvc.org/doc/command-reference/unprotect"><code>dvc unprotect</code></a> if you use advanced<a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer"> DVC setup with symlinks and hardlinks</a> (<code>cache.type</code> config option is not default). If <a href="https://dvc.org/doc/command-reference/gc"><code>dvc gc</code></a> behavior is not granular enough you can manually find the by its cache from the DVC-file in <code>.dvc/cache</code> and remote storage. Learn <a href="https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory" target="_blank" rel="nofollow noopener noreferrer">here</a> how they are organized.</p> <h3 id="q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621057268145848340" target="_blank" rel="nofollow noopener noreferrer">I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements.</a> That means that permanent deletion of files with sensitive data needs to be fully supported.<a href="#q-im-trying-to-understand-if-dvc-is-an-appropriate-solution-for-storing-data-under-gdpr-requirements-that-means-that-permanent-deletion-of-files-with-sensitive-data-needs-to-be-fully-supported" aria-label="q im trying to understand if dvc is an appropriate solution for storing data under gdpr requirements that means that permanent deletion of files with sensitive data needs to be fully supported permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, in this sense DVC is not very different from using bare S3, SSH or any other storage where you can go and just delete data. DVC can give a bit of overhead to locate a specific file to delete, but otherwise it’s all the same you will be able to delete any file you want. See more details on how you retrospectively can edit directories under DVC control <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/621062105524862987" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <hr> <p>If you have any questions, concerns or ideas, let us know in the comments below or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/dvc-org-for-hacktoberfest-2019https://dvc.org/blog/dvc-org-for-hacktoberfest-2019Tue, 08 Oct 2019 00:00:00 GMT<p><a href="https://hacktoberfest.digitalocean.com/" target="_blank" rel="nofollow noopener noreferrer">Hacktoberfest</a> is a monthly-long program that celebrates open source and encourages you to contribute to open source projects (and rewards you with stickers and a cool T-shirt!). Whether you’re a seasoned contributor or looking for projects to contribute to for the first time, you’re welcome to participate!</p> <p>It is the 6th season of Hacktoberfest and the 2d year of participating for DVC.org team. We really enjoyed it in 2018 and this year we are upping the game with our own cool stickers, special edition T-shirts and a <a href="https://github.com/iterative/dvc/labels/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">collection of carefully picked tickets</a>.</p> <h3 id="how-to-participate" style="position:relative;">How to participate?<a href="#how-to-participate" aria-label="how to participate permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you haven’t started your Hacktoberfest challenge yet, it is just the right time, you have 3 weeks left to submit PRs and get your swag! Here are some important details:</p> <ul> <li> <p>Hacktoberfest is open to everyone in the global community.</p> </li> <li> <p>You can sign up anytime between October 1 and October 31. Make sure to sign up on the <a href="https://hacktoberfest.digitalocean.com/" target="_blank" rel="nofollow noopener noreferrer">official Hacktoberfest website</a> for your PRs to count.</p> </li> <li> <p>To get a shirt, you must make 4 legit pull requests (PRs) between October 1–31 in any time zone.</p> </li> <li> <p>Pull requests can be made in any public GitHub-hosted repositories/projects, not just the ones highlighted.</p> </li> </ul> <p>And the special addition from DVC.org team:</p> <ul> <li> <p>Look through the list of <a href="https://github.com/iterative/dvc/labels/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">DVC Hacktoberfest tickets</a> or the list of <a href="https://github.com/iterative/dvc/labels/good%20first%20issue" target="_blank" rel="nofollow noopener noreferrer">good DVC first issues</a>.</p> </li> <li> <p>Make a PR to DVC and get our stickers.</p> </li> <li> <p>Close three issues for DVC and get a special DVC T-shirt.</p> </li> </ul> <h3 id="why-contribute-to-dvc" style="position:relative;">Why contribute to DVC?<a href="#why-contribute-to-dvc" aria-label="why contribute to dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> (Data Version Control) is a relatively young open source project. It was started in late 2017 by a data scientist and an engineer to fill in the gaps in the ML processes tooling. Nowadays DVC is growing pretty fast and though our in-house team is quite small, we have to thank our contributors (more than 100 in both code and docs) for developing DVC with us.</p> <p>DVC is participating in Hacktoberfest for 2 years in a row to bring more people into open source, to learn from them and to give back by sharing our own experience. This year we decided to focus on a single important topic for us — improving UI/UX.</p> <p>As our contributors and maintainers were sifting through the feature requests, bugs, and improvements to create a good <a href="https://github.com/iterative/dvc/labels/hacktoberfest" target="_blank" rel="nofollow noopener noreferrer">list of Hacktoberfest tickets</a>, we noticed that UI/UX label on Github is popping up again and again. DVC is a command line tool, and improving UI/UX in our case means making decisions on how to name command options, where and when to use <a href="https://github.com/iterative/dvc/issues/2498" target="_blank" rel="nofollow noopener noreferrer">confirmation prompts</a> and/or where abort execution, what exactly user would expect to see in the output, how to test it later, etc.</p> <p>Why improving UI/UX appears to be so important for DVC at this stage? Perhaps because the project is more mature now and we are ready to spend more time on polishing it. Or maybe because it is still too-engineering focused and we used to disregard/de-prioritize all this ‘fancy’ stuff. Or it is because we just lack experience in creating good CLI UI/UX!</p> <p>One or another, those are great reasons to focus on improving UI (in a broader sense than just GUI), improving docs, creating powerful consistent experience for our users and increasing accessibility of DVC.</p> <p>That’s how <a href="https://devcenter.heroku.com/articles/cli-style-guide" target="_blank" rel="nofollow noopener noreferrer">Heroku’s CLI style guide</a> starts:</p> <blockquote> <p>Heroku CLI plugins should provide a clear user experience, targeted primarily for human readability and usability, which delights the user, while at the same time supporting advanced users and output formats. This article provides a clear direction for designing delightful CLI plugins.</p> </blockquote> <p>At DVC we are building user experience in line with these principles too, but we also have our own challenges. And here we turn for help to the global open source community and all the contributors out there.</p> <p>For all of us who have a heart for open source — let’s discuss, contribute, learn, take the technologies forward and build something great together!</p> <p>Happy hacking!</p> <hr> <p>We are happy to hear from you <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too!</p>https://dvc.org/blog/september-19-dvc-heartbeathttps://dvc.org/blog/september-19-dvc-heartbeatThu, 26 Sep 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We are super excited to co-host our very first <strong><a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" target="_blank" rel="nofollow noopener noreferrer">meetup in San Francisco on October 10</a></strong>! We will gather at the brand new Dropbox HQ office at 6:30 pm to discuss open-source tools to version control ML models and experiments. <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> is teaming up with <a href="https://www.linkedin.com/in/daniel-fischetti-4a6592bb/" target="_blank" rel="nofollow noopener noreferrer">Daniel Fischetti</a> from <a href="https://standard.ai/" target="_blank" rel="nofollow noopener noreferrer">Standard Cognition</a> to discuss best ML practices. Join us and save your spot now:</p> <p> </p><section class="elp-content-holder"> <a href="https://www.meetup.com/San-Francisco-Machine-Learning-Meetup/events/264846847/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Open-source tools to version control Machine Learning models and experiments</h4> <div class="elp-description">AI and ML are becoming an essential part of the engineering and data science everyday workflow. ML teams need new tools…</div> <div class="elp-link">meetup.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/open-source-tools-to-version-control-9fbecd80e325857bc75eecce526311b8.png" alt="Open-source tools to version control Machine Learning models and experiments"> </div> </a> </section> <p></p> <p>If you are not in SF on this date and happen to be in Europe — don’t miss the PyCon DE & PyData Berlin 2019 joint event on October 9–11. We cannot make it to Berlin this year, but we were thrilled to discover 2 independent talks featuring DVC by <a href="https://pyvideo.org/pydata-berlin-2019/version-control-for-data-science.html" target="_blank" rel="nofollow noopener noreferrer">Alessia Marcolini</a> and <a href="https://pyvideo.org/pydata-berlin-2019/tools-that-help-you-get-your-experiments-under-control.html" target="_blank" rel="nofollow noopener noreferrer">Katharina Rasch</a>.</p> <p>Some other highlights of the end of summer:</p> <ul> <li> <p>Our users and contributors keep creating fantastic pieces of content around DVC (sharing some links below, but it’s only a fraction of what we have in stock — can’t be more happy and humbled about it!).</p> </li> <li> <p>We’ve reached 79 contributors to <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC core project</a> and 74 contributors to <a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC documentation</a> (and have something special in mind to celebrate our 100th contributors).</p> </li> <li> <p>We enjoyed working with all the talented <a href="https://developers.google.com/season-of-docs/" target="_blank" rel="nofollow noopener noreferrer">Google Season of docs</a> applicants and now moving to the next stage with our chosen tech writer <a href="http://dashohoxha.fs.al/" target="_blank" rel="nofollow noopener noreferrer">Dashamir Hoxha</a>.</p> </li> <li> <p>We’ve crossed the 3,000 stars mark on Github (<a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">over 3,500 now</a>). Thank you for your support!</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr"><a href="https://t.co/vhkN3zWzjT">https://t.co/vhkN3zWzjT</a> just hit 3000 stars on <a href="https://twitter.com/hashtag/Github?src=hash&ref_src=twsrc%5Etfw">#Github</a>! <a href="https://t.co/AILppwghuu">https://t.co/AILppwghuu</a> <br>Thank you for your trust, your contributions and your insights🤝<br>We are beyond happy to have you with us on this exciting journey🚀 <a href="https://t.co/dwokD2v7t7">pic.twitter.com/dwokD2v7t7</a></p>— 🦉DVC (@DVCorg) <a href="https://twitter.com/DVCorg/status/1147220439472545793">July 5, 2019</a></blockquote> </li> <li> <p>We’ve had great time at the <a href="https://events.linuxfoundation.org/events/open-source-summit-north-america-2019/program/" target="_blank" rel="nofollow noopener noreferrer">Open Source Summit</a> by Linux foundation in San Diego — speaking on stage, running a booth and chatting with all the amazing open-source crowd out there.</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">Love all <a href="https://twitter.com/DVCorg">@DVCorg</a> booth buzz at <a href="https://twitter.com/hashtag/OSSummit?src=hash&ref_src=twsrc%5Etfw">#OSSummit</a>! 🎉<br>Stop by and grab some cool swag 🌈and participate in our easy fun contest to win a Jetson Nano, the coolest fuzzy owls and a bunch of other staff! 🤩 <a href="https://t.co/MIzfilhrRJ">pic.twitter.com/MIzfilhrRJ</a></p>— Sveta Grinchenko 🇺🇦 (@a142hr) <a href="https://twitter.com/a142hr/status/1164256520235675648">August 21, 2019</a></blockquote> </li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/ccbbea0b26a9ac64744739bf7a5ee8b5/03346/open-source-summit-by-linux-foundation.jpg" alt="open source summit by linux foundation" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <hr> <p>Here are some of the great pieces of content around DVC and ML ops that we discovered in July and August:</p> <ul> <li> <p>** Great insightful discussion on Twitter about versioning ML projects started by <a href="https://medium.com/@NathanBenaich" target="_blank" rel="nofollow noopener noreferrer">Nathan Benaich</a>.**</p> <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">🙏Question to ML friends: How do you go about version control for your ML projects (data, models, and intermediate steps in your data pipelines)? Have you built your own tools? Are using something open source? Or a SaaS? Or does this come bundled with your ML infra products? Thx!</p>— Nathan Benaich (@nathanbenaich) <a href="https://twitter.com/nathanbenaich/status/1151815916512010242">July 18, 2019</a></blockquote> </li> <li> <p><strong><a href="https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804" target="_blank" rel="nofollow noopener noreferrer">Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers</a> by <a href="https://medium.com/@ward.vanlaer" target="_blank" rel="nofollow noopener noreferrer">Ward Van Laer</a>.</strong></p> </li> </ul> <blockquote> <p>It is possible to manage your work flow using open-source and free tools.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/ixorthink/our-machine-learning-workflow-dvc-mlflow-and-training-in-docker-containers-5b9c80cdf804" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers</h4> <div class="elp-description">Googling for machine learning frameworks to version data, track python models etc.. I was surprised to see that these…</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/our-machine-learning-workflow-356399465e0f6c05c8d759fbc3be264a.jpeg" alt="Our Machine Learning Workflow: DVC, MLFlow and Training in Docker Containers"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe" target="_blank" rel="nofollow noopener noreferrer">Using DVC to create an efficient version control system for data projects</a> by <a href="https://medium.com/@basile_16101" target="_blank" rel="nofollow noopener noreferrer">Basile Guerrapin</a>.</strong></li> </ul> <blockquote> <p>DVC brought versioning for inputs, intermediate files and algorithm models to the VAT auto-detection project and this drastically increased our <strong>productivity</strong>.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/qonto-engineering/using-dvc-to-create-an-efficient-version-control-system-for-data-projects-96efd94355fe" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Using DVC to create an efficient version control system for data projects</h4> <div class="elp-description">At first we were looking for a tool to help us dealing with production data files such as trained machine learning…</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/using-dvc-to-create-an-efficient-vcs-377e67b47c660bf412f0284e01c46d16.png" alt="Using DVC to create an efficient version control system for data projects"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://techsparx.com/software-development/ai/dvc/versioning-example.html" target="_blank" rel="nofollow noopener noreferrer">Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues</a> by <a href="https://twitter.com/7genblogger" target="_blank" rel="nofollow noopener noreferrer">David Herron</a>.</strong></li> </ul> <blockquote> <p>In this tutorial we will go over a simple image classifier. We will learn how DVC works in a machine learning project, how it optimizes reproducing results when the project is changed, and how to share the project with colleagues.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://techsparx.com/software-development/ai/dvc/versioning-example.html" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues</h4> <div class="elp-description">Software Development Artificial Intelligence Data Version Control (DVC) Managing versioned machine learning datasets in…</div> <div class="elp-link">techsparx.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/managing-versioned-machine-learning-datasets-c82b4558da0d1197f2ecafe10ae88e5b.jpeg" alt="Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185" target="_blank" rel="nofollow noopener noreferrer">How to use data version control (dvc) in a machine learning project</a> by <a href="https://towardsdatascience.com/@matthiasbitzer94" target="_blank" rel="nofollow noopener noreferrer">Matthias Bitzer</a>.</strong></li> </ul> <blockquote> <p>To illustrate the use of dvc in a machine learning context, we assume that our data is divided into train, test and validation folders by default, with the amount of data increasing over time either through an active learning cycle or by manually adding new data.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/how-to-use-data-version-control-dvc-in-a-machine-learning-project-a78245c0185" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">How to use data version control (dvc) in a machine learning project</h4> <div class="elp-description">When working in a productive machine learning project you probably deal with a tone of data and several models. To keep…</div> <div class="elp-link">towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/how-to-use-data-version-control-0e75fc4c1aaa64b0466ba4235a189f56.jpeg" alt="How to use data version control (dvc) in a machine learning project"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c" target="_blank" rel="nofollow noopener noreferrer">Version Control ML Model</a> by <a href="https://towardsdatascience.com/@TianchenW" target="_blank" rel="nofollow noopener noreferrer">Tianchen Wu</a></strong></li> </ul> <blockquote> <p>This post presents a solution to version control machine learning models with git and dvc (<a href="https://dvc.org/doc/tutorial" target="_blank" rel="nofollow noopener noreferrer">Data Version Control</a>).</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Version Control ML Model</h4> <div class="elp-description">Machine Learning operations (let’s call it MLOps under the current buzzword pattern xxOps) are quite different from…</div> <div class="elp-link">towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/version-control-ml-model-d95a668bbc2b17aaf3cb8795e510d604.png" alt="Version Control ML Model"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://dev.to/robogeek/reflinks-vs-symlinks-vs-hard-links-and-how-they-can-help-machine-learning-projects-1cj4" target="_blank" rel="nofollow noopener noreferrer">Reflinks vs symlinks vs hard links, and how they can help machine learning projects</a> by <a href="https://medium.com/@7genblogger" target="_blank" rel="nofollow noopener noreferrer">David Herron</a></strong></li> </ul> <blockquote> <p>In this blog post we’ll go over the details of using links, some cool new stuff in modern file systems (reflinks), and an example of how DVC (Data Version Control, <a href="https://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">https://dvc.org/</a>) leverages this.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/version-control-ml-model-4adb2db5f87c" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Reflinks vs symlinks vs hard links, and how they can help machine learning projects</h4> <div class="elp-description">Hard links and symbolic links have been available since time immemorial, and we use them all the time without even…</div> <div class="elp-link">dev.to</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/reflinks-vs-symlinks-vs-hard-links-b95aa2004eda8198752604ff86f8321c.jpeg" alt="Reflinks vs symlinks vs hard links, and how they can help machine learning projects"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://blog.codecentric.de/en/2019/08/dvc-dependency-management/" target="_blank" rel="nofollow noopener noreferrer">DVC dependency management — a guide</a> by <a href="https://blog.codecentric.de/en/author/bert-besser/" target="_blank" rel="nofollow noopener noreferrer">Bert Besser</a> and <a href="https://blog.codecentric.de/en/author/veronika-schindler/" target="_blank" rel="nofollow noopener noreferrer">Veronika Schwan</a>.</strong></li> </ul> <blockquote> <p>This post is a follow-up to <a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" target="_blank" rel="nofollow noopener noreferrer">A walkthrough of DVC</a> that deals with managing dependencies between DVC projects. In particular, this follow-up is about importing specific versions of an artifact (e.g. a trained model or a dataset) from one DVC project into another.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://blog.codecentric.de/en/2019/08/dvc-dependency-management/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">DVC dependency management - a guide - codecentric AG Blog</h4> <div class="elp-description">This post is a follow-up to A walkthrough of DVC that deals with managing dependencies between DVC projects. In…</div> <div class="elp-link">blog.codecentric.de</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/dvc-org-a5af4abb87a983796d837b5df9b4f382.png" alt="DVC dependency management - a guide - codecentric AG Blog"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283" target="_blank" rel="nofollow noopener noreferrer">Effective ML Teams — Lessons Learne</a> by <a href="https://medium.com/@czeslaw.szubert" target="_blank" rel="nofollow noopener noreferrer">Czeslaw Szubert</a></strong></li> </ul> <blockquote> <p>In this post I’ll present lessons learned on how to setup successful ML teams and what you need to devise an effective enterprise ML strategy.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Effective ML Teams — Lessons Learned</h4> <div class="elp-description">Machine Learning and Artificial Intelligence has entered our everyday lives — from Virtual Assistants built into each…</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/effective-ml-teams-7a63cfae30c573559b85ebf714981f26.jpeg" alt="Effective ML Teams — Lessons Learned"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://www.esentri.com/lessons-learned-from-training-a-german-speech-recognition-model/" target="_blank" rel="nofollow noopener noreferrer">Lessons learned from training a German Speech Recognition model</a> by <a href="https://www.linkedin.com/in/dschoenleber/" target="_blank" rel="nofollow noopener noreferrer">David Schönleber</a>.</strong></li> </ul> <blockquote> <p>Setting up a documentation-by-design workflow and using appropriate tools where needed, e.g. <em>MLFlow</em> and <em>dvc,</em> can be a real deal-breaker.</p> </blockquote> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@czeslaw.szubert/effective-ml-teams-lessons-learned-6a6e761bc283" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Lessons Learned from Training a German Speech Recognition Model - esentri AG</h4> <div class="elp-description">This post is the first of a two-part series. In this first part, I address learnings from a recent project in which I…</div> <div class="elp-link">esentri.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-09-26/lessons-learned-from-training-47361fe6a8b0e428d267d1cdf745c431.jpeg" alt="Lessons Learned from Training a German Speech Recognition Model - esentri AG"> </div> </a> </section> <p></p> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We are sifting through the issues and discussions and share with you the most interesting takeaways.</p> <h3 id="q-im-getting-an-error-message-while-trying-to-use-aws-s3-storage-error-failed-to-push-data-to-the-cloud--unable-to-locate-credentials-any-ideas-whats-happening" style="position:relative;">Q: I’m getting an error message while trying to use AWS S3 storage: <code>ERROR: failed to push data to the cloud — Unable to locate credentials.</code> <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/587792932061577218" target="_blank" rel="nofollow noopener noreferrer">Any ideas what’s happening?</a><a href="#q-im-getting-an-error-message-while-trying-to-use-aws-s3-storage-error-failed-to-push-data-to-the-cloud--unable-to-locate-credentials-any-ideas-whats-happening" aria-label="q im getting an error message while trying to use aws s3 storage error failed to push data to the cloud unable to locate credentials any ideas whats happening permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Most likely you haven’t configured your S3 credentials/AWS account yet. Please, read the full documentation on the AWS website. The short version of what should be done is the following:</p> <ul> <li> <p><a href="https://portal.aws.amazon.com/gp/aws/developer/registration/index.html" target="_blank" rel="nofollow noopener noreferrer">Create your AWS account.</a></p> </li> <li> <p>Log in to your AWS Management Console.</p> </li> <li> <p>Click on your user name at the top right of the page.</p> </li> <li> <p>Click on the Security Credentials link from the drop-down menu.</p> </li> <li> <p>Find the Access Credentials section, and copy the latest <code>Access Key ID</code>.</p> </li> <li> <p>Click on the Show link in the same row, and copy the <code>Secret Access Key</code>.</p> </li> </ul> <p>Follow <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html" target="_blank" rel="nofollow noopener noreferrer">this link</a> to setup your environment.</p> <h3 id="q-i-added-data-with-dvc-add-or-dvc-run-and-see-that-it-takes-twice-what-it-was-before-with-du-command-does-it-mean-that-dvc-copies-data-that-is-added-under-its-control-how-do-i-prevent-this-from-happening" style="position:relative;">Q: I added data with <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> or <code>dvc run</code> and see that it takes twice what it was before (with <code>du</code> command). <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/595402051203235861" target="_blank" rel="nofollow noopener noreferrer">Does it mean that DVC copies data that is added under its control? How do I prevent this from happening?</a><a href="#q-i-added-data-with-dvc-add-or-dvc-run-and-see-that-it-takes-twice-what-it-was-before-with-du-command-does-it-mean-that-dvc-copies-data-that-is-added-under-its-control-how-do-i-prevent-this-from-happening" aria-label="q i added data with dvc add or dvc run and see that it takes twice what it was before with du command does it mean that dvc copies data that is added under its control how do i prevent this from happening permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>To give a short summary — by default, DVC copies the files from your working directory to the cache (this is for safety reasons, it is better to duplicate the data). If you have reflinks (copy-on-write) enabled on your file system, DVC will use that method — which is as safe as copying. You can also configure DVC to use hardlinks/symlinks to save some space and time, but it will require enabling the protected mode (making data files in workspace read-only). Read more details <a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="q-how-concurrent-friendly-is-the-cache-and-different-remotes-is-it-safe-to-have-several-containersnodes-fill-the-same-cache-at-the-same-time" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/599345778703597568" target="_blank" rel="nofollow noopener noreferrer">How concurrent-friendly is the cache? And different remotes? Is it safe to have several containers/nodes fill the same cache at the same time?</a><a href="#q-how-concurrent-friendly-is-the-cache-and-different-remotes-is-it-safe-to-have-several-containersnodes-fill-the-same-cache-at-the-same-time" aria-label="q how concurrent friendly is the cache and different remotes is it safe to have several containersnodes fill the same cache at the same time permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It is safe and a very common use case for DVC to have a shared cache. Please, check <a href="https://discuss.dvc.org/t/share-nas-data-in-server/180/12" target="_blank" rel="nofollow noopener noreferrer">this thread</a>, for example.</p> <h3 id="qwhat-is-the-proper-way-to-exit-the-ascii-visualization-when-you-run-dvc-pipeline-show-command" style="position:relative;">Q:<a href="https://discordapp.com/channels/485586884165107732/563406153334128681/603890677176336394" target="_blank" rel="nofollow noopener noreferrer">What is the proper way to exit the ASCII visualization?</a> (when you run <code>dvc pipeline show</code> command).<a href="#qwhat-is-the-proper-way-to-exit-the-ascii-visualization-when-you-run-dvc-pipeline-show-command" aria-label="qwhat is the proper way to exit the ascii visualization when you run dvc pipeline show command permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>See this <a href="https://dvc.org/doc/commands-reference/pipeline/show#options" target="_blank" rel="nofollow noopener noreferrer">document</a>. To navigate, use arrows or W, A, S, D keys. To exit, press Q.</p> <h3 id="q-is-there-an-issue-if-i-set-my-caches3-external-cache-to-my-default-remote-i-dont-quite-understand-what-an-external-cache-is-for-other-than-i-have-to-have-it-for-external-outputs" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/606197026488844338" target="_blank" rel="nofollow noopener noreferrer">Is there an issue if I set my <code>cache.s3</code> external cache to my default remote?</a> I don’t quite understand what an external cache is for other than I have to have it for external outputs.<a href="#q-is-there-an-issue-if-i-set-my-caches3-external-cache-to-my-default-remote-i-dont-quite-understand-what-an-external-cache-is-for-other-than-i-have-to-have-it-for-external-outputs" aria-label="q is there an issue if i set my caches3 external cache to my default remote i dont quite understand what an external cache is for other than i have to have it for external outputs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Short answer is that we would suggest keeping them separately to avoid possible checksum overlaps. Checksum on S3 might theoretically overlap with our checksums (with the content of the file being different), so it could be dangerous. The chances of losing data are pretty slim, but we would not risk it. Right now, we are working on making sure there are no possible overlapping.</p> <h3 id="q-whats-the-right-procedure-to-move-a-step-dvc-file-around-the-project" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/606425815139221504" target="_blank" rel="nofollow noopener noreferrer">What’s the right procedure to move a step .dvc file around the project?</a><a href="#q-whats-the-right-procedure-to-move-a-step-dvc-file-around-the-project" aria-label="q whats the right procedure to move a step dvc file around the project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Assuming the file was created with <code>dvc run</code>. There are few possible ways. Obvious one is to delete the file and create a new one with <code>dvc run --no-exec -f file/path/and/name.dvc</code>. Another possibility is to rename/move and then edit manually. See <a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">this document</a> that describes how DVC-files are organized. No matter what method you use, you can run <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit file.dvc</code></a> to save changes without running the command again.</p> <h3 id="q-dvc-status-doesnt-seem-to-report-things-that-need-to-be-dvc-pushed-is-that-by-design" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/606917839688957952" target="_blank" rel="nofollow noopener noreferrer"><code>dvc status</code> doesn’t seem to report things that need to be dvc pushed, is that by design?</a><a href="#q-dvc-status-doesnt-seem-to-report-things-that-need-to-be-dvc-pushed-is-that-by-design" aria-label="q dvc status doesnt seem to report things that need to be dvc pushed is that by design permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You should try with dvc status <code>--cloud</code> or <a href="https://dvc.org/doc/command-reference/status#--remote"><code>dvc status --remote <your-remote></code></a> to compare your local cache with a remote one, by default it only compares the “working directory” with your local cache (to check whether something should be reproduced and saved or not).</p> <h3 id="q-what-kind-of-files-can-you-put-into-dvc-metrics" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/608701494035873792" target="_blank" rel="nofollow noopener noreferrer">What kind of files can you put into <code>dvc metrics</code>?</a><a href="#q-what-kind-of-files-can-you-put-into-dvc-metrics" aria-label="q what kind of files can you put into dvc metrics permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The file could be in any format, <a href="https://dvc.org/doc/command-reference/metrics"><code>dvc metrics</code></a> show will try to interpret the format and output it in the best possible way. Also, if you are using <code>csv</code> or <code>json</code>, you can use the <code>--xpath</code> flag to query specific measurements. <strong>In general, you can make any file a metric file and put any content into it, DVC is not opinionated about it.</strong> Usually though these are files that measures the performance/accuracy of your model and captures configuration of experiments. The idea is to use <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> to display all your metrics across experiments so you can make decisions of which combination (of features, parameters, algorithms, architecture, etc.) works the best.</p> <h3 id="q-does-dvc-take-into-account-the-timestamp-of-a-file-or-is-the-md5-only-depends-on-the-files-actualbits-content" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/613639458000207902" target="_blank" rel="nofollow noopener noreferrer">Does DVC take into account the timestamp of a file or is the MD5 only depends on the files actual/bits content?</a><a href="#q-does-dvc-take-into-account-the-timestamp-of-a-file-or-is-the-md5-only-depends-on-the-files-actualbits-content" aria-label="q does dvc take into account the timestamp of a file or is the md5 only depends on the files actualbits content permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC takes into account only content (bits) of a file to calculate hashes that are saved into DVC-files.</p> <h3 id="q-similar-to-dvc-gc-is-there-a-command-to-garbage-collect-from-the-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/616421757808541721" target="_blank" rel="nofollow noopener noreferrer">Similar to <code>dvc gc</code> is there a command to garbage collect from the remote?</a><a href="#q-similar-to-dvc-gc-is-there-a-command-to-garbage-collect-from-the-remote" aria-label="q similar to dvc gc is there a command to garbage collect from the remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://dvc.org/doc/command-reference/gc#--remote"><code>dvc gc --remote NAME</code></a> is doing this, but you should be extra careful, because it will remove everything that is not currently “in use” (by the working directory). Also, please check this <a href="https://github.com/iterative/dvc/issues/2325" target="_blank" rel="nofollow noopener noreferrer">issue</a> — semantics of this command might have changed by the time you read this.</p> <h3 id="q-how-do-i-use-and-configure-remote-storage-on-ibm-cloud-object-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/591237578209099786" target="_blank" rel="nofollow noopener noreferrer">How do I use and configure remote storage on IBM Cloud Object Storage?</a><a href="#q-how-do-i-use-and-configure-remote-storage-on-ibm-cloud-object-storage" aria-label="q how do i use and configure remote storage on ibm cloud object storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Since it’s S3 compatible, specifying <code>endpointurl</code> (exact URL depends on the <a href="https://cloud.ibm.com/docs/services/cloud-object-storage?topic=cloud-object-storage-endpoints" target="_blank" rel="nofollow noopener noreferrer">region</a>) is the way to go:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> <span class="token parameter variable">-d</span> mybucket s3://path/to/dir </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> mybucket <span class="token punctuation">\</span> endpointurl <span class="token punctuation">\</span> https://s3.eu.cloud-object-storage.appdomain.cloud</span></code></pre></div> <h3 id="q-how-can-i-push-data-from-client-to-google-cloud-bucket-using-dvc-just-want-to-know-how-can-i-set-the-credentials" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/592958360903483403" target="_blank" rel="nofollow noopener noreferrer">How can I push data from client to google cloud bucket using DVC?</a>. Just want to know how can i set the credentials.<a href="#q-how-can-i-push-data-from-client-to-google-cloud-bucket-using-dvc-just-want-to-know-how-can-i-set-the-credentials" aria-label="q how can i push data from client to google cloud bucket using dvc just want to know how can i set the credentials permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can do it by setting environment variable pointing to yours credentials path, like:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">export</span> <span class="token assign-left variable">GOOGLE_APPLICATION_CREDENTIALS</span><span class="token operator">=</span>path/to/credentials</span></code></pre></div> <p>It is also possible to set this variable via <a href="https://dvc.org/doc/command-reference/config"><code>dvc config</code></a>:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> myremote credentialpath /path/to/my/creds</span></code></pre></div> <p>where <code>myremote</code> is your remote name.</p> <hr> <p>If you have any questions, concerns or ideas, let us know in the comments below or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/july-19-dvc-heartbeathttps://dvc.org/blog/july-19-dvc-heartbeatThu, 01 Aug 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>As we continue to grow DVC together with our fantastic contributors, we enjoy more and more insights, discussions, and articles either created or brought to us by our community. We feel it is the right time to start sharing more of your news, your stories and your discoveries. New Heartbeat is here!</p> <p>Speaking of our own news — next month DVC team is going to the <a href="https://events.linuxfoundation.org/events/open-source-summit-north-america-2019/" target="_blank" rel="nofollow noopener noreferrer">Open Source North America Summit</a>. It is taking place in San Diego on August 21–23. <a href="https://ossna19.sched.com/speaker/dmitry35" target="_blank" rel="nofollow noopener noreferrer">Dmitry</a> and <a href="https://ossna19.sched.com/speaker/svetlanagrinchenko" target="_blank" rel="nofollow noopener noreferrer">Sveta</a> will be giving talks and we will run a booth. So looking forward to it! Stop by for a chat and some cool swag. And if you are in San Diego on those days and want to catch up — please let us know <a href="http://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> or on Twitter!</p> <p> </p><section class="elp-content-holder"> <a href="https://ossna19.sched.com/event/PUVv/open-source-tools-for-ml-experiments-management-dmitry-petrov-ruslan-kuprieiev-iterative-ai" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Open Source Summit + ELC North America 2019: Open Source Tools for ML Experiments Man...</h4> <div class="elp-description">Speakers Software Engineer, Iterative AI Ruslan is a Software Engineer at Iterative AI. Previously he worked on live…</div> <div class="elp-link">ossna19.sched.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-08-01/open-source-north-america-summit-fc282755298bb8aa0dd4feb0d7fad084.png" alt="Open Source Summit + ELC North America 2019: Open Source Tools for ML Experiments Man..."> </div> </a> </section> <p></p> <p> </p><section class="elp-content-holder"> <a href="https://ossna19.sched.com/event/PWNk/speaker-preparation-simple-steps-with-a-tremendous-impact-svetlana-grinchenko-dvcorg" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Open Source Summit + ELC North America 2019: Speaker Preparation: Simple Steps with a...</h4> <div class="elp-description">Speakers Head of Developer Relations, DVC.org Svetlana is driving developer relations and community at DVC.org…</div> <div class="elp-link">ossna19.sched.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-08-01/open-source-north-america-summit-fc282755298bb8aa0dd4feb0d7fad084.png" alt="Open Source Summit + ELC North America 2019: Speaker Preparation: Simple Steps with a..."> </div> </a> </section> <p></p> <p>Every month our team is excited to discover new great pieces of content addressing some of the burning ML issues. Here are some of the links that caught our eye in June:</p> <ul> <li><strong><a href="https://dev.to/robogeek/principled-machine-learning-4eho" target="_blank" rel="nofollow noopener noreferrer">Principled Machine Learning: Practices and Tools for Efficient Collaboration</a> by <a href="https://medium.com/@7genblogger" target="_blank" rel="nofollow noopener noreferrer">David Herron</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://dev.to/robogeek/principled-machine-learning-4eho" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Principled Machine Learning: Practices and Tools for Efficient Collaboration</h4> <div class="elp-description">Machine learning projects are often harder than they should be. The code to train an ML model is just software, and we…</div> <div class="elp-link">dev.to</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-08-01/principled-machine-learning-20fa6eb01b0a36da8cdc0d97d2f96cc2.jpeg" alt="Principled Machine Learning: Practices and Tools for Efficient Collaboration"> </div> </a> </section> <p></p> <blockquote> <p>As we’ve seen in this article some tools and practices can be borrowed from regular software engineering. However, the needs of machine learning projects dictate tools that better fit the purpose.</p> </blockquote> <ul> <li><strong>First <a href="http://ml-repa.ru/" target="_blank" rel="nofollow noopener noreferrer">ML-REPA</a><a href="http://ml-repa.ru/page6697700.html" target="_blank" rel="nofollow noopener noreferrer">Meetup: Reproducible ML experiments</a> hosted by <a href="https://dgtl.raiffeisen.ru/" target="_blank" rel="nofollow noopener noreferrer">Raiffeisen DGTL</a> check out the video and slide decks.</strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="http://ml-repa.ru/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Machine Learning REPA</h4> <div class="elp-description">Анонсы мероприятий, проектов, обзоров инструментов и кейсов про ML проекты, управление экспериментами, автоматизацию и…</div> <div class="elp-link">ml-repa.ru</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-08-01/machine-learning-repa-56a858376395041b5fa650963e76fac1.png" alt="Machine Learning REPA"> </div> </a> </section> <p></p> <p><a href="http://ml-repa.ru/" target="_blank" rel="nofollow noopener noreferrer">ML-REPA</a> is an a new fantastic resource for Russian-speaking folks interested in Reproducibility, Experiments and Pipelines Automation. Curated by <a href="https://twitter.com/mnrozhkov" target="_blank" rel="nofollow noopener noreferrer">Mikhail Rozhkov</a> and highly recommended by our team.</p> <h3 id="how-do-you-manage-your-machine-learning-experiments-discussion-on-reddit-is-full-of-insights" style="position:relative;"><a href="https://www.reddit.com/r/MachineLearning/comments/bx0apm/d_how_do_you_manage_your_machine_learning/" target="_blank" rel="nofollow noopener noreferrer">How do you manage your machine learning experiments?</a> discussion on Reddit is full of insights.<a href="#how-do-you-manage-your-machine-learning-experiments-discussion-on-reddit-is-full-of-insights" aria-label="how do you manage your machine learning experiments discussion on reddit is full of insights permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <blockquote class="reddit-card" data-card-created="1576789144"><a href="https://www.reddit.com/r/MachineLearning/comments/bx0apm/d_how_do_you_manage_your_machine_learning/">[D] How do you manage your machine learning experiments?</a> from <a href="http://www.reddit.com/r/MachineLearning">r/MachineLearning</a></blockquote> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We are sifting through the issues and discussions and share with you the most interesting takeaways.</p> <h3 id="q-i-have-within-one-git-repository-different-folders-with-very-different-content-basically-different-projects-or-content-i-want-to-have-different-permissions-to-and-i-thought-about-using-different-buckets-in-aws-as-remotes-im-not-sure-if-its-possible-with-dvc-to-store-some-files-in-some-remote-and-some-other-files-in-some-other-remote-is-it" style="position:relative;">Q: I have within one git repository different folders with very different content (basically different projects, or content I want to have different permissions to), and I thought about using different buckets in AWS as remotes. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/575718048330416158" target="_blank" rel="nofollow noopener noreferrer">I’m not sure if it’s possible with DVC to store some files in some remote, and some other files in some other remote, is it?</a><a href="#q-i-have-within-one-git-repository-different-folders-with-very-different-content-basically-different-projects-or-content-i-want-to-have-different-permissions-to-and-i-thought-about-using-different-buckets-in-aws-as-remotes-im-not-sure-if-its-possible-with-dvc-to-store-some-files-in-some-remote-and-some-other-files-in-some-other-remote-is-it" aria-label="q i have within one git repository different folders with very different content basically different projects or content i want to have different permissions to and i thought about using different buckets in aws as remotes im not sure if its possible with dvc to store some files in some remote and some other files in some other remote is it permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can definitely add more than one remote (see <a href="https://dvc.org/doc/commands-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">dvc remote add</a>) and then <a href="https://dvc.org/doc/commands-reference/push" target="_blank" rel="nofollow noopener noreferrer">dvc push</a> has a <code>-R</code> option to pick which one to send the cached data files (deps, outs, etc) to. We would not recommend doing this though. It complicates the commands you have to run — you will need to remember to specify a remote name for every command that deals with data — <code>push</code>, <code>pull</code>, <code>gc</code>, <code>fetch</code>, <code>status</code>, etc. Please, leave a comment in the relevant issue <a href="https://github.com/iterative/dvc/issues/2095" target="_blank" rel="nofollow noopener noreferrer">here</a> if this case is important for you.</p> <h3 id="q-is-that-possible-with-dvc-to-have-multiple-few-metric-files-and-compare-them-all-at-once-for-example-wed-like-to-consider-as-metrics-the-loss-of-a-neural-network-training-process-loss-as-a--m-output-of-a-training-stage-and-also-apart-knowing-the-accuracy-of-the-nn-on-a-test-set-another--m-output-of-eval-stage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/578532350221352987" target="_blank" rel="nofollow noopener noreferrer">Is that possible with DVC to have multiple (few) metric files and compare them all at once?</a> For example, we’d like to consider as metrics the loss of a neural network training process (loss as a <code>-M</code> output of a training stage), and also apart knowing the accuracy of the NN on a test set (another <code>-M</code> output of eval stage).<a href="#q-is-that-possible-with-dvc-to-have-multiple-few-metric-files-and-compare-them-all-at-once-for-example-wed-like-to-consider-as-metrics-the-loss-of-a-neural-network-training-process-loss-as-a--m-output-of-a-training-stage-and-also-apart-knowing-the-accuracy-of-the-nn-on-a-test-set-another--m-output-of-eval-stage" aria-label="q is that possible with dvc to have multiple few metric files and compare them all at once for example wed like to consider as metrics the loss of a neural network training process loss as a m output of a training stage and also apart knowing the accuracy of the nn on a test set another m output of eval stage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, it is totally fine to use <code>-M</code> in different stages. <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> will just show both metrics.</p> <h3 id="q-i-have-a-scenario-where-an-artifacts-data-folder-is-created-by-the-dvc-run-command-via-the--o-flag-i-have-manually-added-another-file-into-or-modified-the-artifacts-folder-but-when-i-do-dvc-push-nothing-happens-is-there-anyway-around-this" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/577362750443880449" target="_blank" rel="nofollow noopener noreferrer">I have a scenario where an artifacts (data) folder is created by the dvc run command via the <code>-o</code> flag. I have manually added another file into or modified the artifacts folder but when I do <code>dvc push</code> nothing happens, is there anyway around this?</a><a href="#q-i-have-a-scenario-where-an-artifacts-data-folder-is-created-by-the-dvc-run-command-via-the--o-flag-i-have-manually-added-another-file-into-or-modified-the-artifacts-folder-but-when-i-do-dvc-push-nothing-happens-is-there-anyway-around-this" aria-label="q i have a scenario where an artifacts data folder is created by the dvc run command via the o flag i have manually added another file into or modified the artifacts folder but when i do dvc push nothing happens is there anyway around this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let’s first do a quick recap on how DVC handles data files (you can definitely find more information on the <a href="http://dvc.org/docs" target="_blank" rel="nofollow noopener noreferrer">DVC documentation site</a>).</p> <ul> <li> <p>When you do <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, <code>dvc run</code> or <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> DVC puts artifacts (in case of <code>dvc run</code> artifacts == outputs produced by the command) into <code>.dvc/cache</code> directory (default cache location). You don’t see this happening because <a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">DVC keeps links</a> (or in certain cases creates a copy) to these files/directories.</p> </li> <li> <p><a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> does not move files from the workspace (that what you see) to the remote storage, it always moves files/directories that are already in cache (default is .dvc/cache).</p> </li> <li> <p>So, now you’ve added a file manually, or made some other modifications. But these files are not in cache yet. The analogy would be <code>git commit</code>. You change the file, you do <code>git commit</code>, only after that you can push something to Git server (Github/Gitlab, etc). The difference is that DVC is doing commit (moves files to cache) automatically in certain cases — <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>, <code>dvc run</code>, etc.</p> </li> </ul> <p>There is an explicit command — <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> - that you should run if you want to enforce the change to the output produced by <code>dvc run</code>. This command will update the corresponding DVC- files (.dvc extension) and will move data to cache. After that you should be able to run <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> to save your data on the external storage.</p> <p>Note, when you do an explicit commit like this you are potentially “breaking” the reproducibility. In a sense that there is no guarantee now that your directory can be produced by <code>dvc run</code>/<a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> — since you changed it manually.</p> <h3 id="q-id-like-to-transform-my-dataset-in-place-to-avoid-copying-it-but-i-cant-use-dvc-run-to-do-this-because-it-doesnt-allow-the-same-directory-as-an-output-and-a-dependency" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/578898899469729796" target="_blank" rel="nofollow noopener noreferrer">I’d like to transform my dataset in-place to avoid copying it, but I can’t use <code>dvc run</code> to do this because it doesn’t allow the same directory as an output and a dependency.</a><a href="#q-id-like-to-transform-my-dataset-in-place-to-avoid-copying-it-but-i-cant-use-dvc-run-to-do-this-because-it-doesnt-allow-the-same-directory-as-an-output-and-a-dependency" aria-label="q id like to transform my dataset in place to avoid copying it but i cant use dvc run to do this because it doesnt allow the same directory as an output and a dependency permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You could do this in one step (one stage). So that getting your data and modifying it, is one stage. So you don’t depend on the data folder. You just could depend on your download + modifying script.</p> <h3 id="q-can-anyone-tell-me-what-this-error-message-is-about-to-avoid-unpredictable-behavior-rerun-command-with-non-overlapping-outs-paths" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/579283950778712076" target="_blank" rel="nofollow noopener noreferrer">Can anyone tell me what this error message is about?</a> “To avoid unpredictable behavior, rerun command with non overlapping outs paths.”<a href="#q-can-anyone-tell-me-what-this-error-message-is-about-to-avoid-unpredictable-behavior-rerun-command-with-non-overlapping-outs-paths" aria-label="q can anyone tell me what this error message is about to avoid unpredictable behavior rerun command with non overlapping outs paths permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Most likely it means that there is a DVC-file that have the same output twice. Or there two DVC-files that share the same output file.</p> <h3 id="q-im-getting-no-such-file-or-directory-error-when-i-do-dvc-run-or-dvc-repro-the-command-runs-find-if-i-dont-use-dvc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/580176327701823498" target="_blank" rel="nofollow noopener noreferrer">I’m getting “No such file or directory” error when I do <code>dvc run</code> or <code>dvc repro</code></a>. The command runs find if I don’t use DVC.<a href="#q-im-getting-no-such-file-or-directory-error-when-i-do-dvc-run-or-dvc-repro-the-command-runs-find-if-i-dont-use-dvc" aria-label="q im getting no such file or directory error when i do dvc run or dvc repro the command runs find if i dont use dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>That happens because dvc run is trying to ensure that your command is the one creating your output and removes existing outputs before executing the command. So that when you run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> later, it will be able to fully reproduce the output. So you need to make the script create the directory or file.</p> <h3 id="q-im-implementing-a-cicd-and-i-would-like-to-simplify-my-cicd-or-even-my-training-code-keeping-them-cloud-agnostic-by-using-dvc-pull-inside-my-docker-container-when-initializing-a-training-job--can-dvc-be-used-in-this-way" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/581256265234251776" target="_blank" rel="nofollow noopener noreferrer">I’m implementing a CI/CD and I would like to simplify my CI/CD or even my training code (keeping them cloud agnostic) by using <code>dvc pull</code> inside my Docker container when initializing a training job. </a> Can DVC be used in this way?<a href="#q-im-implementing-a-cicd-and-i-would-like-to-simplify-my-cicd-or-even-my-training-code-keeping-them-cloud-agnostic-by-using-dvc-pull-inside-my-docker-container-when-initializing-a-training-job--can-dvc-be-used-in-this-way" aria-label="q im implementing a cicd and i would like to simplify my cicd or even my training code keeping them cloud agnostic by using dvc pull inside my docker container when initializing a training job can dvc be used in this way permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes, it’s definitely a valid case for DVC. There are different ways of organizing the storage that training machines are using to access data. From the very simple — using local storage volume and pulling data from the remote storage every time — to using NAS or EFS to store a shared DVC cache.</p> <h3 id="q-i-was-able-to-follow-the-getting-started-examples-however-now-i-am-trying-to-push-my-data-to-github-i-keep-getting-the-following-error-error-failed-to-push-data-to-the-cloud--upload-is-not-supported-by-https-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/598866528984891403" target="_blank" rel="nofollow noopener noreferrer">I was able to follow the getting started examples, however now I am trying to push my data to Github, I keep getting the following error: “ERROR: failed to push data to the cloud — upload is not supported by https remote”.</a><a href="#q-i-was-able-to-follow-the-getting-started-examples-however-now-i-am-trying-to-push-my-data-to-github-i-keep-getting-the-following-error-error-failed-to-push-data-to-the-cloud--upload-is-not-supported-by-https-remote" aria-label="q i was able to follow the getting started examples however now i am trying to push my data to github i keep getting the following error error failed to push data to the cloud upload is not supported by https remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>HTTP remotes do not support upload yet. Example Get Started repository is using HTTP to keep it read-only and abstract the actual storage provider we are using internally. If you actually check the remote URL, you should see that it is an S3 bucket and AWS provides an HTTP end-point to read data from it.</p> <h3 id="q-im-looking-to-configure-aws-s3-as-a-storage-for-dvc-ive-set-up-the-remotes-and-initialized-dvc-in-the-git-repository-i-tried-testing-it-by-pushing-a-dataset-in-the-form-of-an-excel-file-the-command-completed-without-any-issues-but-this-is-what-im-seeing-in-s3-dvc-seems-to-have-created-a-subdirectory-in-the-intended-directory-called-35-where-it-placed-this-file-with-a-strange-name" style="position:relative;">Q: I’m looking to configure AWS S3 as a storage for DVC. I’ve set up the remotes and initialized dvc in the git repository. I tried testing it by pushing a dataset in the form of an excel file. The command completed without any issues but this is what I’m seeing in S3. <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/585967551708921856" target="_blank" rel="nofollow noopener noreferrer">DVC seems to have created a subdirectory in the intended directory called “35” where it placed this file with a strange name.</a><a href="#q-im-looking-to-configure-aws-s3-as-a-storage-for-dvc-ive-set-up-the-remotes-and-initialized-dvc-in-the-git-repository-i-tried-testing-it-by-pushing-a-dataset-in-the-form-of-an-excel-file-the-command-completed-without-any-issues-but-this-is-what-im-seeing-in-s3-dvc-seems-to-have-created-a-subdirectory-in-the-intended-directory-called-35-where-it-placed-this-file-with-a-strange-name" aria-label="q im looking to configure aws s3 as a storage for dvc ive set up the remotes and initialized dvc in the git repository i tried testing it by pushing a dataset in the form of an excel file the command completed without any issues but this is what im seeing in s3 dvc seems to have created a subdirectory in the intended directory called 35 where it placed this file with a strange name permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is not an issue, it is an implementation detail. There’s no current way to upload the files with the original filename (In this case, the S3 bucket will have the file <code>data.csv</code> but with another name <code>20/893143…</code>). The reason behind this decision is because we want to store a file only once no matter how many dataset versions it’s used in. Also, it’s a reliable way to uniquely identify the file. You don’t have to be afraid that someone decided to create a file with the same name (path) but a different content.</p> <h3 id="q-is-it-possible-to-only-have-a-shared-local-cache-and-no-remote-im-trying-to-figure-out-how-to-use-this-in-a-40-node-cluster-which-already-has-very-fast-nfs-storage-across-all-the-nodes-not-storing-everything-twice-seems-desirable-esp-for-the-multi-tb-input-data" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/587730054893666326" target="_blank" rel="nofollow noopener noreferrer">Is it possible to only have a shared ‘local’ cache and no remote?</a> I’m trying to figure out how to use this in a 40 node cluster which already has very fast NFS storage across all the nodes. Not storing everything twice seems desirable. Esp. for the multi-TB input data<a href="#q-is-it-possible-to-only-have-a-shared-local-cache-and-no-remote-im-trying-to-figure-out-how-to-use-this-in-a-40-node-cluster-which-already-has-very-fast-nfs-storage-across-all-the-nodes-not-storing-everything-twice-seems-desirable-esp-for-the-multi-tb-input-data" aria-label="q is it possible to only have a shared local cache and no remote im trying to figure out how to use this in a 40 node cluster which already has very fast nfs storage across all the nodes not storing everything twice seems desirable esp for the multi tb input data permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes and it’s one of the very common use case, actually. All you need to do is to use dvc cache dir command to setup an external cache. There are few caveats though. Please, read <a href="https://discuss.dvc.org/t/share-nas-data-in-server/180/4?u=shcheklein" target="_blank" rel="nofollow noopener noreferrer">this link</a> for an example of the workflow.</p> <hr> <p>If you have any questions, concerns or ideas, let us know in the comments below or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are always open, too.</p>https://dvc.org/blog/june-19-dvc-heartbeathttps://dvc.org/blog/june-19-dvc-heartbeatWed, 26 Jun 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We want to start by saying to our users, contributors, and community members how grateful we are for the fantastic work you are doing contributing to DVC, giving talks about DVC, sharing your feedback, use cases and your concerns. A huge thank you to each of you from the DVC team!</p> <p>We would love to give back and support any positive initiative around DVC — just let us know <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and we will send you a bunch of cool swag, connect to a tech expert or find another way to support your project. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are open, too.</p> <p><strong>And if you have 4 minutes to spare, we are conducting out first <a href="https://docs.google.com/forms/d/1tmn8YHLUkeSi5AIq4DGJi28iZy9HTazl6DWKe3Hxpnc/edit?ts=5cfc47c2" target="_blank" rel="nofollow noopener noreferrer">DVC user survey</a> and would love to hear from you!</strong></p> <p>Aside from admiring great DVC-related content from our users we have one more reason to particularly enjoy the past month — DVC team went to Cleveland to attend <a href="https://us.pycon.org/2019/about/" target="_blank" rel="nofollow noopener noreferrer">PyCon 2019</a> and it was a blast!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/b123f78f23b67bb29be863d7452154a3/03346/cleveland-to-attend-pycon-2019.jpg" alt="cleveland to attend pycon 2019" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Amazing <a href="https://github.com/sureL" target="_blank" rel="nofollow noopener noreferrer">Jennifer</a> and her artwork for our <a href="https://twitter.com/hashtag/SupportOpenSource" target="_blank" rel="nofollow noopener noreferrer">SupportOpenSource</a> contest</em></p> <p>We had it all. Running our first ever conference booth, leading an impromptu unconference discussion and arranging some cool <a href="https://twitter.com/hashtag/SupportOpenSource?src=hashtag_click" target="_blank" rel="nofollow noopener noreferrer">#SupportOpenSource</a> activities was great! Last-minute accommodation cancellations, booth equipment delivery issues, and being late for our very own talk was not so great. Will be sharing more about it in a separate blogpost soon.</p> <div class="yt-embed-wrapper"><iframe width="100%" height="315" src="https://www.youtube-nocookie.com/embed/jkfh2PM5Sz8?rel=0&&showinfo=0;" frameborder="0" allow="autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe><div class="yt-embed-wrapper__overlay"><span class="yt-embed-wrapper__tooltip">By clicking play, you agree to YouTube's <a href="https://policies.google.com/u/3/privacy?hl=en" target="_blank" rel="nofollow noopener noreferrer">Privacy Policy</a> and <a href="https://www.youtube.com/static?template=terms" target="_blank" rel="nofollow noopener noreferrer">Terms of Service</a></span></div></div> <p>Here is <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a>’s PyCon <a href="https://www.youtube.com/watch?v=jkfh2PM5Sz8" target="_blank" rel="nofollow noopener noreferrer">talk</a> and <a href="https://docs.google.com/presentation/d/1CYt0w8WoZAXiQEtVDVDsTnQumzdZx91v32MwEK20R-E/edit" target="_blank" rel="nofollow noopener noreferrer">slides</a> on Machine learning model and dataset versioning practices.</p> <p>We absolutely loved being at PyCon and can’t wait for our next conference!</p> <hr> <p>Our team is so happy every time we discover an article featuring DVC or addressing one of the burning ML issues we are trying to solve. Here are some of the links that caught our eye past month:</p> <ul> <li><strong><a href="https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4" target="_blank" rel="nofollow noopener noreferrer">The Rise of DataOps (from the ashes of Data Governance)</a> by <a href="https://towardsdatascience.com/@ryanwgross" target="_blank" rel="nofollow noopener noreferrer">Ryan Gross</a>.</strong></li> </ul> <p>A brilliant comprehensive read on the current data management issues. It might be the best article we have ever read on this subject. Every word strongly resonates with our vision and ideas behind DVC. Highly recommended by DVC team!</p> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The Rise of DataOps (from the ashes of Data Governance)</h4> <div class="elp-description">Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive…</div> <div class="elp-link">towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-06-26/the-rise-of-data-ops-1966d840bf0394acafc57223d40c26d2.png" alt="The Rise of DataOps (from the ashes of Data Governance)"> </div> </a> </section> <p></p> <blockquote> <p>Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline. At the end of the transformation, data governance will look a lot more like DevOps, with data stewards, scientists, and engineers working closely together to codify the governance policies.</p> </blockquote> <ul> <li><strong><a href="https://medium.com/@christopher.samiullah/first-impressions-of-data-science-version-control-dvc-fe96ab29cdda" target="_blank" rel="nofollow noopener noreferrer">First Impressions of Data Science Version Control (DVC)</a> by <a href="https://christophergs.github.io/" target="_blank" rel="nofollow noopener noreferrer">Christopher Samiullah</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://medium.com/@christopher.samiullah/first-impressions-of-data-science-version-control-dvc-fe96ab29cdda" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">First Impressions of Data Science Version Control (DVC)</h4> <div class="elp-description">A Powerful New Machine Learning Tool</div> <div class="elp-link">medium.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-06-26/first-impressions-of-data-science-version-control-e96f9af1cebb895e023e79e4de6eb0f3.png" alt="First Impressions of Data Science Version Control (DVC)"> </div> </a> </section> <p></p> <blockquote> <p>In 2019, we tend to find organizations using a mix of git, Makefiles, ad hoc scripts and reference files to try and achieve reproducibility. DVC enters this mix offering a cleaner solution, specifically targeting Data Science challenges.</p> </blockquote> <ul> <li><strong><a href="https://github.com/peopledoc/mlvtools-tutorial" target="_blank" rel="nofollow noopener noreferrer">Versioning and Reproducibility with MLV-tools and DVC</a>: <a href="https://peopledoc.github.io/mlvtools-tutorial/talks/pyData/presentation.html#/" target="_blank" rel="nofollow noopener noreferrer">Talk</a> and <a href="https://peopledoc.github.io/mlvtools-tutorial/talks/workshop/presentation.html#/" target="_blank" rel="nofollow noopener noreferrer">Tutorial</a> by <a href="https://github.com/sbracaloni" target="_blank" rel="nofollow noopener noreferrer">Stéphanie Bracaloni</a> and <a href="https://github.com/SdgJlbl" target="_blank" rel="nofollow noopener noreferrer">Sarah Diot-Girard</a>.</strong></li> </ul> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/72397df92519affe8d30d67d72539d3f/39600/versioning-and-reproducibility-with-mlv-tools.png" alt="versioning and reproducibility with mlv tools" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <ul> <li><strong><a href="https://www.oreilly.com/ideas/becoming-a-machine-learning-company-means-investing-in-foundational-technologies" target="_blank" rel="nofollow noopener noreferrer">Becoming a machine learning company means investing in foundational technologies</a> by <a href="https://www.oreilly.com/people/4e7ad-ben-lorica" target="_blank" rel="nofollow noopener noreferrer">Ben Lorica</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://www.oreilly.com/ideas/becoming-a-machine-learning-company-means-investing-in-foundational-technologies" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Becoming a machine learning company means investing in foundational technologies</h4> <div class="elp-description">Get expert knowledge on the tools and technologies you need to put your data strategies to work. Join us at the…</div> <div class="elp-link">oreilly.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-06-26/becoming-a-machine-learning-company-307760aa3e556f62ddc35f90eec73eed.jpeg" alt="Becoming a machine learning company means investing in foundational technologies"> </div> </a> </section> <p></p> <blockquote> <p>With an eye toward the growing importance of machine learning, we recently completed <a href="https://www.oreilly.com/data/free/evolving-data-infrastructure.csp" target="_blank" rel="nofollow noopener noreferrer">a data infrastructure survey</a> that drew more than 3,200 respondents.</p> </blockquote> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We are sifting through the issues and discussions and share with you the most interesting takeaways.</p> <h3 id="q-does-dvc-support-azure-data-lake-gen1" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575655655629651968" target="_blank" rel="nofollow noopener noreferrer">Does DVC support Azure Data Lake Gen1?</a><a href="#q-does-dvc-support-azure-data-lake-gen1" aria-label="q does dvc support azure data lake gen1 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Azure data lake is HDFS compatible. And DVC supports HDFS remotes. Give it a try and let us know if you hit any problems <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="q-an-excellent-discussion-on-versioning-tabular-sql-data-do-you-know-of-any-tools-that-deal-better-with-sql-specific-versioning" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575681811401801748" target="_blank" rel="nofollow noopener noreferrer">An excellent discussion on versioning tabular (SQL) data.</a> Do you know of any tools that deal better with SQL-specific versioning?<a href="#q-an-excellent-discussion-on-versioning-tabular-sql-data-do-you-know-of-any-tools-that-deal-better-with-sql-specific-versioning" aria-label="q an excellent discussion on versioning tabular sql data do you know of any tools that deal better with sql specific versioning permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It’s a wide topic. The actual solution might depend on a specific scenario and what exactly needs to be versioned. DVC does not provide any special functionality on top of databases to version their content.</p> <p>Depending on your use case, our recommendation would be to run SQL and pull the result file (CSV/TSV file?) that then can be used to do analysis. This file can be taken under DVC control. Alternatively, in certain cases source files (that are used to populate the databases) can be taken under control and we can keep versions of them, or track incoming updates.</p> <p>Read the <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575681811401801748" target="_blank" rel="nofollow noopener noreferrer">discussion</a> to learn more.</p> <h3 id="q-how-does-dvc-do-the-versioning-between-binary-files-is-there-a-binary-diff-similar-to-git-or-is-every-version-stored-distinctly-in-full" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/575686711821205504" target="_blank" rel="nofollow noopener noreferrer">How does DVC do the versioning between binary files?</a> Is there a binary diff, similar to git? Or is every version stored distinctly in full?<a href="#q-how-does-dvc-do-the-versioning-between-binary-files-is-there-a-binary-diff-similar-to-git-or-is-every-version-stored-distinctly-in-full" aria-label="q how does dvc do the versioning between binary files is there a binary diff similar to git or is every version stored distinctly in full permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC is just saving every file as is, we don’t use binary diffs right now. There won’t be a full directory (if you added just a few files to a 10M files directory) duplication, though, since we treat every file inside as a separate entity.</p> <h3 id="q-is-there-a-way-to-pass-parameters-from-eg-dvc-repro-to-stages" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/576160840701575169" target="_blank" rel="nofollow noopener noreferrer">Is there a way to pass parameters from e.g. <code>dvc repro</code> to stages?</a><a href="#q-is-there-a-way-to-pass-parameters-from-eg-dvc-repro-to-stages" aria-label="q is there a way to pass parameters from eg dvc repro to stages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The simplest option is to create a config file — json or whatnot — that your scripts would read and your stages depend on.</p> <h3 id="q-what-is-the-best-way-to-get-cached-output-files-from-different-branches-simultaneously-for-example-cached-tensorboard-files-from-different-branches-to-compare-experiments" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/577852740034625576" target="_blank" rel="nofollow noopener noreferrer">What is the best way to get cached output files from different branches simultaneously?</a> For example, cached tensorboard files from different branches to compare experiments.<a href="#q-what-is-the-best-way-to-get-cached-output-files-from-different-branches-simultaneously-for-example-cached-tensorboard-files-from-different-branches-to-compare-experiments" aria-label="q what is the best way to get cached output files from different branches simultaneously for example cached tensorboard files from different branches to compare experiments permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There is a way to do that through our (still not officially released) API pretty easily. Here is an <a href="https://cdn.discordapp.com/attachments/563406153334128681/577894682722304030/dvc_get_output_files.py" target="_blank" rel="nofollow noopener noreferrer">example script</a> how it could be done.</p> <h3 id="q-docker-and-dvc-to-being-able-to-pushpull-data-we-need-to-run-a-git-clone-to-get-dvc-files-and-remote-definitions--but-we-worry-that-would-make-the-container-quite-heavy-since-it-contains-our-entire-project-history" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/563406153334128681/583949033685516299" target="_blank" rel="nofollow noopener noreferrer">Docker and DVC.</a> To being able to push/pull data we need to run a git clone to get DVC-files and remote definitions — but we worry that would make the container quite heavy (since it contains our entire project history).<a href="#q-docker-and-dvc-to-being-able-to-pushpull-data-we-need-to-run-a-git-clone-to-get-dvc-files-and-remote-definitions--but-we-worry-that-would-make-the-container-quite-heavy-since-it-contains-our-entire-project-history" aria-label="q docker and dvc to being able to pushpull data we need to run a git clone to get dvc files and remote definitions but we worry that would make the container quite heavy since it contains our entire project history permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>You can do <code>git clone — depth 1</code>, which will not download any history except the latest commits.</p> <h3 id="q-after-dvc-pushing-the-same-file-it-creates-multiple-copies-of-the-same-file-is-that-how-its-supposed-to-work" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/574133734136086559" target="_blank" rel="nofollow noopener noreferrer">After DVC pushing the same file, it creates multiple copies of the same file. Is that how it’s supposed to work?</a><a href="#q-after-dvc-pushing-the-same-file-it-creates-multiple-copies-of-the-same-file-is-that-how-its-supposed-to-work" aria-label="q after dvc pushing the same file it creates multiple copies of the same file is that how its supposed to work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you are pushing the same file, there are no copies pushed or saved in the cache. DVC is using checksums to identify files, so if you add the same file once again, it will detect that cache for it is already in the local cache and wont copy it again to cache. Same with dvc push, if it sees that you already have cache file with that checksum on your remote, it won’t upload it again.</p> <h3 id="q-how-do-i-uninstall-dvc-on-mac-installed-via-pkg-installer" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/574941227624169492" target="_blank" rel="nofollow noopener noreferrer">How do I uninstall DVC on Mac (installed via <code>pkg</code> installer)?</a><a href="#q-how-do-i-uninstall-dvc-on-mac-installed-via-pkg-installer" aria-label="q how do i uninstall dvc on mac installed via pkg installer permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Something like this should work:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">which</span> dvc </span>/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc <span class="token line"><span class="token input">$ </span><span class="token command">ls</span> <span class="token parameter variable">-la</span> /usr/local/bin/dvc </span>/usr/local/bin/dvc -> /usr/local/lib/dvc/dvc <span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-f</span> /usr/local/bin/dvc </span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> <span class="token function">rm</span> <span class="token parameter variable">-rf</span> /usr/local/lib/dvc </span><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> pkgutil <span class="token parameter variable">--forget</span> com.iterative.dvc</span></code></pre></div> <h3 id="q-how-do-i-pull-from-a-public-s3-bucket-that-contains-dvc-remote" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/575236576309674024" target="_blank" rel="nofollow noopener noreferrer">How do I pull from a public S3 bucket (that contains DVC remote)?</a><a href="#q-how-do-i-pull-from-a-public-s3-bucket-that-contains-dvc-remote" aria-label="q how do i pull from a public s3 bucket that contains dvc remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Just add public URL of the bucket as an HTTP endpoint. See <a href="https://github.com/iterative/example-get-started/blob/master/.dvc/config" target="_blank" rel="nofollow noopener noreferrer">here</a> for an example. <a href="https://remote.dvc.org/get-started" target="_blank" rel="nofollow noopener noreferrer">https://remote.dvc.org/get-started</a> is made to redirect to the S3 bucket anyone can read from.</p> <h3 id="q-im-getting-the-same-error-over-and-over-about-locking-error-failed-to-lock-before-running-a-command--cannot-perform-the-cmd-since-dvc-is-busy-and-locked-please-retry-the-command-later" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101" target="_blank" rel="nofollow noopener noreferrer">I’m getting the same error over and over about locking:</a> <code>ERROR: failed to lock before running a command — cannot perform the cmd since DVC is busy and locked. Please retry the command later.</code><a href="#q-im-getting-the-same-error-over-and-over-about-locking-error-failed-to-lock-before-running-a-command--cannot-perform-the-cmd-since-dvc-is-busy-and-locked-please-retry-the-command-later" aria-label="q im getting the same error over and over about locking error failed to lock before running a command cannot perform the cmd since dvc is busy and locked please retry the command later permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Most likely it happens due to an attempt to run DVC on NFS that has some configuration problems. There is a <a href="https://github.com/iterative/dvc/issues/1918" target="_blank" rel="nofollow noopener noreferrer">well known problem with DVC on NFS</a> — sometimes it hangs on trying to lock a file. The usual workaround for this problem is to allocate DVC cache on NFS, but run the project (git clone, DVC metafiles, etc) on the local file system. Read <a href="https://discuss.dvc.org/t/share-nas-data-in-server/180/4?u=shcheklein" target="_blank" rel="nofollow noopener noreferrer">this answer</a> to see how it can be setup.</p> <hr> <p>If you have any questions, concerns or ideas, let us know in the comments below or connect with DVC team <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a>. Our <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">DMs on Twitter</a> are open, too.</p>https://dvc.org/blog/may-19-dvc-heartbeathttps://dvc.org/blog/may-19-dvc-heartbeatTue, 21 May 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>This section of DVC Heartbeat is growing with every new Issue and this is already quite a good piece of news!</p> <p>One of the most exciting things we want to share this month is acceptance of DVC into the <a href="https://developers.google.com/season-of-docs/" target="_blank" rel="nofollow noopener noreferrer">Google Season of Docs</a>. It is a new and unique program sponsored by Google that pairs technical writers with open source projects to collaborate and improve the open source project documentation. You can find the outline of DVC vision and project ideas in <a href="https://blog.dataversioncontrol.com/dvc-project-ideas-for-google-summer-of-docs-2019-defe3a73b248" target="_blank" rel="nofollow noopener noreferrer">this dedicated blogpost</a> and check the <a href="https://developers.google.com/season-of-docs/docs/participants/" target="_blank" rel="nofollow noopener noreferrer">full list of participating open source organizations</a>. Technically the <a href="https://developers.google.com/season-of-docs/docs/timeline" target="_blank" rel="nofollow noopener noreferrer">program is starting in a few months</a>, but there is already a fantastic increase in the amount of commits and contributors, and we absolutely love it!</p> <p>The other important milestone for us was the first offline meeting with our distributed remote team. Working side by side and having non-Zoom meetings with the team was amazing. Joining our forces to prepare for the upcoming conferences turned out to be the most valuable, educating and uniting experience for the whole team.</p> <p>It’s a shame that our tech lead was unable to join us it due to another visa denial. We do hope he will finally make it to the USA for the next big conference.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/060f8f204b833689b1569a4162d67e3d/39600/the-world-is-changing.png" alt="the world is changing" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>While we were busy finalizing all the PyCon 2019 prep, our own <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">Dmitry Petrov</a> flew to New York to speak at the <a href="https://conferences.oreilly.com/artificial-intelligence/ai-ny" target="_blank" rel="nofollow noopener noreferrer">O’Reilly AI Conference</a> about the <a href="https://www.oreilly.com/library/view/artificial-intelligence-conference/9781492050544/video324691.html" target="_blank" rel="nofollow noopener noreferrer">Open Source tools for Machine Learning Models and Datasets versioning</a>. Unfortunately the video is available for the registered users only (with a free trial option) but you can have a look at Dmitry’s slides <a href="https://www.slideshare.net/DmitryPetrov15/dvc-oreilly-artificial-intelligence-conference-2019-new-york" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 404px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/bee9b4ed9981db1bf7eb9db8450fc8d1/39600/iterative-ai-twitter.png" alt="iterative ai twitter" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>We renamed our Twitter! Our old handle was a bit misleading and we moved from @Iterativeai to <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">@DVCorg</a> (yet keep the old one for future projects).</p> <p>Our team is so happy every time we discover an article featuring DVC or addressing one of the burning ML issues we are trying to solve. Here are some of our favorite links from the past month:</p> <ul> <li><strong><a href="https://www.pythonpodcast.com/data-version-control-episode-206/" target="_blank" rel="nofollow noopener noreferrer">Version Control For Your Machine Learning Projects — Episode 206</a></strong> by <strong><a href="https://www.linkedin.com/in/tmacey/" target="_blank" rel="nofollow noopener noreferrer">Tobias Macey</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://www.pythonpodcast.com/data-version-control-episode-206/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Version Control For Machine Learning Projects</h4> <div class="elp-description">An interview with the creator of DVC about how it improves collaboration and reduces duplicate effort on data science…</div> <div class="elp-link">pythonpodcast.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-05-21/version-control-for-your-machine-learning-projects-d29d6b83905b901e7573d865b78db914.png" alt="Version Control For Machine Learning Projects"> </div> </a> </section> <p></p> <blockquote> <p>Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle.</p> </blockquote> <ul> <li><strong>Here is an <a href="https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee" target="_blank" rel="nofollow noopener noreferrer">article</a> by <a href="https://medium.com/@faviovazquez" target="_blank" rel="nofollow noopener noreferrer">Favio Vázquez</a> with a transcript of this podcast episode.</strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data version control with DVC. What do the authors have to say?</h4> <div class="elp-description">Data versioning is one of the most ignored features in data science projects, but that has to change. Here I’ll discuss…</div> <div class="elp-link">towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-05-21/data-version-control-with-dvc-e9c8eefcd560f601394a53c3c300bfe5.png" alt="Data version control with DVC. What do the authors have to say?"> </div> </a> </section> <p></p> <ul> <li><strong><a href="https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8" target="_blank" rel="nofollow noopener noreferrer">Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis</h4> <div class="elp-description">Some claim the machine learning field is in a crisis due to software tooling that’s insufficient to ensure repeatable…</div> <div class="elp-link">towardsdatascience.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-05-21/why-git-and-git-lfs-is-not-enough-eaf6ce46d5fc3cf9d0672d03331d00b1.jpeg" alt="Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis"> </div> </a> </section> <p></p> <blockquote> <p>With Git-LFS your team has better control over the data, because it is now version controlled. Does that mean the problem is solved? Earlier we said the “<em>key issue is the training data</em>”, but that was a lie. Sort of. Yes keeping the data under version control is a big improvement. But is the lack of version control of the data files the entire problem? No.</p> </blockquote> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We are sifting through the issues and discussions and share with you the most interesting takeaways.</p> <h3 id="q-this-might-be-a-favourite-gem-of-ours---our-engineers-are-so-fast-that-someone-assumed-they-were-bots" style="position:relative;">Q: This might be <a href="https://discordapp.com/channels/485586884165107732/485598848111083531/572960640122224640" target="_blank" rel="nofollow noopener noreferrer">a favourite gem of ours </a> — our engineers are so fast that someone assumed they were bots.<a href="#q-this-might-be-a-favourite-gem-of-ours---our-engineers-are-so-fast-that-someone-assumed-they-were-bots" aria-label="q this might be a favourite gem of ours our engineers are so fast that someone assumed they were bots permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We feared that too until we met them in person. They appeared to be real (unless bots also love Ramen now)!</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/4926411413e184b4531924e6c0aeaf02/39600/bots-also-love-ramen-now.png" alt="bots also love ramen now" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <h3 id="q-is-this-the-best-way-to-track-data-with-dvc-when-code-and-data-are-separate-having-being-burned-by-this-a-couple-of-times-ie-accidentally-pushing-large-files-to-github-i-now-keep-my-code-and-data-separate" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/572974117351849997" target="_blank" rel="nofollow noopener noreferrer">Is this the best way to track data with DVC when code and data are separate?</a> Having being burned by this a couple of times, i.e accidentally pushing large files to GitHub, I now keep my code and data separate.<a href="#q-is-this-the-best-way-to-track-data-with-dvc-when-code-and-data-are-separate-having-being-burned-by-this-a-couple-of-times-ie-accidentally-pushing-large-files-to-github-i-now-keep-my-code-and-data-separate" aria-label="q is this the best way to track data with dvc when code and data are separate having being burned by this a couple of times ie accidentally pushing large files to github i now keep my code and data separate permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Every time you run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to start tracking some data artifact, its path is automatically added to the <code>.gitignore</code> file, as a result it is hard to commit it to git by mistake — you would need to explicitly modify the <code>.gitignore</code> first. The feature to track some external data is called <a href="https://dvc.org/doc/user-guide/managing-external-data" target="_blank" rel="nofollow noopener noreferrer">external outputs</a> (if all you need is to track some data artifacts). Usually it is used when you have some data on S3 or SSH and don’t want to pull it into your working space, but it’s working even when your data is located on the same machine outside of the repository.</p> <h3 id="q-how-do-i-wrap-a-step-that-downloads-a-filedirectory-into-a-dvc-stage-i-want-to-ensure-that-it-runs-only-if-file-has-no-been-downloaded-yet" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/571342592508428289" target="_blank" rel="nofollow noopener noreferrer">How do I wrap a step that downloads a file/directory into a DVC stage?</a> I want to ensure that it runs only if file has no been downloaded yet<a href="#q-how-do-i-wrap-a-step-that-downloads-a-filedirectory-into-a-dvc-stage-i-want-to-ensure-that-it-runs-only-if-file-has-no-been-downloaded-yet" aria-label="q how do i wrap a step that downloads a filedirectory into a dvc stage i want to ensure that it runs only if file has no been downloaded yet permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Use <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a> to track and download the remote data first time and next time when you do dvc repro if data has changed remotely. If you don’t want to track remote changes (lock the data after it was downloaded), use <code>dvc run</code> with a dummy dependency (any text file will do you do not touch) that runs an actual wget/curl to get the data.</p> <h3 id="q-how-do-i-show-a-pipeline-that-does-not-have-a-default-dvcfile-eg-i-assigned-all-files-names-manually-with--f-in-the-dvc-run-command-and-i-just-dont-have-dvcfile-anymore" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570943786151313408" target="_blank" rel="nofollow noopener noreferrer">How do I show a pipeline that does not have a default Dvcfile?</a> (e.g. I assigned all files names manually with <code>-f</code> in the <code>dvc run</code> command and I just don’t have <code>Dvcfile</code> anymore)<a href="#q-how-do-i-show-a-pipeline-that-does-not-have-a-default-dvcfile-eg-i-assigned-all-files-names-manually-with--f-in-the-dvc-run-command-and-i-just-dont-have-dvcfile-anymore" aria-label="q how do i show a pipeline that does not have a default dvcfile eg i assigned all files names manually with f in the dvc run command and i just dont have dvcfile anymore permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Almost any command in DVC that deals with pipelines (set of DVC-files) accepts a single stage as a target, for example:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">dvc</span> pipeline show — ascii model.dvc</span></code></pre></div> <h3 id="q-dvc-hangs-or-im-getting-database-is-locked-issue" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570843482218823682" target="_blank" rel="nofollow noopener noreferrer">DVC hangs or I’m getting <code>database is locked</code> issue</a><a href="#q-dvc-hangs-or-im-getting-database-is-locked-issue" aria-label="q dvc hangs or im getting database is locked issue permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It’s a well known problem with NFS, CIFS (Azure) — they do not support file locks properly which is required by the SQLLite engine to operate. The easiest workaround — don’t create a DVC project on network attached partition. In certain cases a fix can be made by changing mounting options, check <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570276668694855690" target="_blank" rel="nofollow noopener noreferrer">this discussion</a> for the Azure ML Service.</p> <h3 id="q-how-do-i-use-dvc-if-i-use-a-separate-drive-to-store-the-data-and-a-smallfast-ssd-to-run-computations-i-dont-have-enough-space-to-bring-data-to-my-working-space" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/570091809594671126" target="_blank" rel="nofollow noopener noreferrer">How do I use DVC if I use a separate drive to store the data and a small/fast SSD to run computations?</a> I don’t have enough space to bring data to my working space.<a href="#q-how-do-i-use-dvc-if-i-use-a-separate-drive-to-store-the-data-and-a-smallfast-ssd-to-run-computations-i-dont-have-enough-space-to-bring-data-to-my-working-space" aria-label="q how do i use dvc if i use a separate drive to store the data and a smallfast ssd to run computations i dont have enough space to bring data to my working space permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>An excellent question! The short answer is:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># To move your data cache to a big partition</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc cache dir</span> <span class="token parameter variable">--local</span> /path/to/an/external/partition </span> <span class="token comment"># To enable symlinks/harldinks to avoid actual copying</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> cache.type reflink, hardlink, symlink, copy </span> <span class="token comment"># To protect the cache</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc config</span> cache.protected <span class="token boolean">true</span></span></code></pre></div> <p>The last one is highly recommended to make links in your working space read-only to avoid corrupting the cache. Read more about different link types <a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <p>To add your data first time to the DVC cache, do a clone of the repository on a big partition and run <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to add your data. Then you can do <code>git pull</code>, <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> on a small partition and DVC will create all the necessary links.</p> <h3 id="q-why-im-getting-paths-for-outs-overlap-error-when-i-run-dvc-add-or-dvc-run" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/571335064374345749" target="_blank" rel="nofollow noopener noreferrer">Why I’m getting <code>Paths for outs overlap</code> error when I run <code>dvc add</code> or <code>dvc run</code>?</a><a href="#q-why-im-getting-paths-for-outs-overlap-error-when-i-run-dvc-add-or-dvc-run" aria-label="q why im getting paths for outs overlap error when i run dvc add or dvc run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Usually it means that a parent directory of one of the arguments for <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> / <code>dvc run</code> is already tracked. For example, you’ve added the whole datasets directory already. And now you are trying to add a subdirectory, which is already tracked as a part of the datasets one. No need to do that. You could <a href="https://dvc.org/doc/command-reference/add"><code>dvc add datasets</code></a> or <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro datasets.dvc</code></a> to save changes.</p> <h3 id="q-im-getting-ascii-codec-cant-encode-character-error-on-dvc-commands-when-i-deal-with-unicode-file-names" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/567310354766495747" target="_blank" rel="nofollow noopener noreferrer">I’m getting <code>ascii codec can’t encode character</code> error on DVC commands when I deal with unicode file names</a><a href="#q-im-getting-ascii-codec-cant-encode-character-error-on-dvc-commands-when-i-deal-with-unicode-file-names" aria-label="q im getting ascii codec cant encode character error on dvc commands when i deal with unicode file names permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><a href="https://perlgeek.de/en/article/set-up-a-clean-utf8-environment" target="_blank" rel="nofollow noopener noreferrer">Check the locale settings you have</a> (<code>locale</code> command in Linux). Python expects a locale that can handle unicode printing. Usually it’s solved with these commands: <code>export LC_ALL=en_US.UTF-8</code> and <code>export LANG=en_US.UTF-8</code>. You can place those exports into <code>.bashrc</code> or other file that defines your environment.</p> <h3 id="q-does-dvc-use-the-same-logins-aws-cli-has-when-using-an-s3-bucket-as-its-reporemote-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/563149775340568576" target="_blank" rel="nofollow noopener noreferrer">Does DVC use the same logins <code>aws-cli</code> has when using an S3 bucket as its repo/remote storage</a>?<a href="#q-does-dvc-use-the-same-logins-aws-cli-has-when-using-an-s3-bucket-as-its-reporemote-storage" aria-label="q does dvc use the same logins aws cli has when using an s3 bucket as its reporemote storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>In short — yes, but it can be also configured. DVC is going to use either your default profile (from <code>~/.aws/*</code>) or your env vars by default. If you need more flexibility (e.g. you need to use different credentials for different projects, etc) check out <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html" target="_blank" rel="nofollow noopener noreferrer">this guide</a> to configure custom aws profiles and then you could use them with DVC using these <a href="https://dvc.org/doc/commands-reference/remote/add#options" target="_blank" rel="nofollow noopener noreferrer">remote options</a>.</p> <h3 id="q-how-can-i-output-multiple-metrics-from-a-single-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/566000729505136661" target="_blank" rel="nofollow noopener noreferrer">How can I output multiple metrics from a single file?</a><a href="#q-how-can-i-output-multiple-metrics-from-a-single-file" aria-label="q how can i output multiple metrics from a single file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Let’s say I have the following in a file:</p> <div class="gatsby-highlight" data-language="json"><pre class="language-json"><code class="language-json"><span class="token punctuation">{</span> “AUC_RATIO”<span class="token operator">:</span> <span class="token punctuation">{</span> “train”<span class="token operator">:</span> <span class="token number">0.8922748258797667</span><span class="token punctuation">,</span> “valid”<span class="token operator">:</span> <span class="token number">0.8561602726251776</span><span class="token punctuation">,</span> “xval”<span class="token operator">:</span> <span class="token number">0.8843431199314923</span> <span class="token punctuation">}</span> <span class="token punctuation">}</span></code></pre></div> <p>How can I show both <code>train</code> and <code>valid</code> without <code>xval</code>?</p> <p>You can use <a href="https://dvc.org/doc/command-reference/metrics/show"><code>dvc metrics show</code></a> command <code>--xpath</code> option and provide multiple attribute names to it:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc metrics show</span> metrics.json <span class="token punctuation">\</span> <span class="token parameter variable">--type</span> json <span class="token punctuation">\</span> <span class="token parameter variable">--xpath</span> AUC_RATIO<span class="token punctuation">[</span>train,valid<span class="token punctuation">]</span> </span> metrics.json: 0.89227482588 0.856160272625</code></pre></div> <h3 id="q-what-is-the-quickest-way-to-add-a-new-dependency-to-a-dvc-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/566314479499870211" target="_blank" rel="nofollow noopener noreferrer">What is the quickest way to add a new dependency to a DVC-file?</a><a href="#q-what-is-the-quickest-way-to-add-a-new-dependency-to-a-dvc-file" aria-label="q what is the quickest way to add a new dependency to a dvc file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are a few options to add a new dependency:</p> <ul> <li> <p>simply opening a file with your favorite editor and adding a dependency there without md5. DVC will understand that that stage is changed and will re-run and re-calculate md5 checksums during the next DVC repro;</p> </li> <li> <p>use <code>dvc run --no-exec</code> is another option. It will rewrite the existing file for you with new parameters.</p> </li> </ul> <h3 id="q-is-there-a-way-to-add-a-dependency-to-a-python-package-so-it-runs-a-stage-again-if-it-imported-the-updated-library" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/566315265646788628" target="_blank" rel="nofollow noopener noreferrer">Is there a way to add a dependency to a python package, so it runs a stage again if it imported the updated library?</a><a href="#q-is-there-a-way-to-add-a-dependency-to-a-python-package-so-it-runs-a-stage-again-if-it-imported-the-updated-library" aria-label="q is there a way to add a dependency to a python package so it runs a stage again if it imported the updated library permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>The only recommended way so far would be to somehow make DVC know about your package’s version. One way to do that would be to create a separate stage that would be dynamically printing version of that specific package into a file, that your stage would depend on:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-o</span> mypkgver 'pip show mypkg <span class="token operator">></span> mypkgver’ </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> mypkgver <span class="token parameter variable">-d</span> <span class="token punctuation">..</span>. <span class="token parameter variable">-o</span> <span class="token punctuation">..</span> mycmd</span></code></pre></div> <h3 id="q-is-there-anyway-to-forcibly-recompute-the-hashes-of-dependencies-in-a-pipeline-dvc-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/564807276146458624" target="_blank" rel="nofollow noopener noreferrer">Is there anyway to forcibly recompute the hashes of dependencies in a pipeline DVC-file?</a><a href="#q-is-there-anyway-to-forcibly-recompute-the-hashes-of-dependencies-in-a-pipeline-dvc-file" aria-label="q is there anyway to forcibly recompute the hashes of dependencies in a pipeline dvc file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>E.g. I made some whitespace/comment changes in my code and I want to tell DVC “it’s ok, you don’t have to recompute everything”.</p> <p>Yes, you could <a href="https://dvc.org/doc/command-reference/commit#-f"><code>dvc commit -f</code></a>. It will save all current checksum without re-running your commands.</p> <h3 id="q-i-have-projects-that-use-data-thats-stored-in-s3-i-never-have-data-locally-to-use-dvc-push-but-i-would-like-to-have-this-data-version-controlled-is-there-a-way-to-use-the-features-of-dvc-in-this-use-case" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/563352000281182218" target="_blank" rel="nofollow noopener noreferrer">I have projects that use data that’s stored in S3. I never have data locally to use <code>dvc push</code>, but I would like to have this data version controlled.</a> Is there a way to use the features of DVC in this use case?<a href="#q-i-have-projects-that-use-data-thats-stored-in-s3-i-never-have-data-locally-to-use-dvc-push-but-i-would-like-to-have-this-data-version-controlled-is-there-a-way-to-use-the-features-of-dvc-in-this-use-case" aria-label="q i have projects that use data thats stored in s3 i never have data locally to use dvc push but i would like to have this data version controlled is there a way to use the features of dvc in this use case permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Yes! This DVC features is called <a href="https://dvc.org/doc/user-guide/large-dataset-optimization" target="_blank" rel="nofollow noopener noreferrer">external outputs</a> and <a href="https://dvc.org/doc/user-guide/external-dependencies" target="_blank" rel="nofollow noopener noreferrer">external dependencies</a>. You can use one of them or both to track, process, and version your data on a cloud storage without downloading it locally.</p> <hr> <p>If you have any questions, concerns or ideas, let us know <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and our stellar team will get back to you in no time!</p>https://dvc.org/blog/dvc-project-ideas-for-google-summer-of-docs-2019https://dvc.org/blog/dvc-project-ideas-for-google-summer-of-docs-2019Tue, 23 Apr 2019 00:00:00 GMT<p>We strongly believe that well-shaped documentation is key for making the product truly open. We have been investing lots of time and energy in improving our docs lately. Being a team of 90% engineers we are eager to welcome the writers into our team and our community. We are happy to share our experience, introduce them to the world of open source and machine learning best practices, guide through the OS contribution process and work together on improving our documentation.</p> <p>DVC was started in late 2017 by a data scientist and an engineer. It is now growing pretty fast and though our in-house team is quite small, we have to thank our contributors (more than 80 in both code and docs) for developing DVC with us. When working with DVC the technical writer will not only get lots of hands-on experience in writing technical docs, but will also immerse into DVC community — a warm and welcoming gathering of ML and DS enthusiasts and an invaluable source of inspiration and expertise in ML engineering.</p> <h3 id="about-dvc" style="position:relative;">About DVC<a href="#about-dvc" aria-label="about dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC is a brainchild of a data scientist and an engineer, that was created to fill in the gaps in the ML processes tooling and evolved into a successful open source project.</p> <p>ML brings changes in development and research processes. These ML processes require new tools for data versioning, ML pipeline versioning, resource management for model training and others that haven’t been formalized. The traditional software development tools do not fully cover ML team’s needs but there are no good alternatives. It makes engineers to custom develop a new toolset to manage data files, keep track of ML experiments and connect data and source code together. The ML process becomes very fragile and requires tons of tribal knowledge.</p> <p>We have been working on <a href="http://DVC.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> by adopting best ML practices and turning them into Git-like command line tool. DVC versions multi-gigabyte datasets and ML models, make them shareable and reproducible. The tool helps to organize a more rigorous process around datasets and the data derivatives. Your favorite cloud storage (S3, GCS, or bare metal SSH server) could be used with DVC as a data file backend.</p> <p>If you are interested in learning a little bit more about DVC and its journey, here is a great interview with DVC creator in the Episode 206 of Podcast.<strong>init</strong>. Listen to it <a href="https://www.pythonpodcast.com/data-version-control-episode-206/" target="_blank" rel="nofollow noopener noreferrer">HERE </a>or read the transcript <a href="https://towardsdatascience.com/data-version-control-with-dvc-what-do-the-authors-have-to-say-3c3b10f27ee" target="_blank" rel="nofollow noopener noreferrer">HERE.</a></p> <h3 id="the-state-of-dvc-documentation" style="position:relative;">The state of DVC documentation<a href="#the-state-of-dvc-documentation" aria-label="the state of dvc documentation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC is a pretty young project, developed and maintained solely by engineers. As many OS projects we started from the bottom and for a long time our <a href="https://dvc.org/doc" target="_blank" rel="nofollow noopener noreferrer">documentation</a> was a bunch of bits and pieces. Nowadays improving documentation is one of our top priorities. We moved to the new in-house built documentation engine and started working with several technical writers. Certain parts have been tremendously improved recently, e.g. <a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started</a> and <a href="https://dvc.org/doc/commands-reference/fetch" target="_blank" rel="nofollow noopener noreferrer">certain parts of Commands Reference</a> . So far most of our documentation has been written majorly by the engineering team and there is need for improving the overall structure and making some parts more friendly from a new user perspective. We have mostly complete <a href="https://dvc.org/doc/commands-reference" target="_blank" rel="nofollow noopener noreferrer">reference documentation</a> for each command, although some functions are missing good actionable examples. We also have a <a href="https://dvc.org/doc/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a>, however it is not in very good shape. We strive for making our documentation clear and comprehensive for users of various backgrounds and proficiency levels and this is where we do need some fresh perspective.</p> <h3 id="how-dvc-documentation-is-built" style="position:relative;">How DVC documentation is built<a href="#how-dvc-documentation-is-built" aria-label="how dvc documentation is built permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We have an open Github Apache-2 licensed repository for the <a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC website</a>, the documentation engine and the <a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">documentation files</a>. The website is built with Node.js + React, including the documentation engine (built in-house).</p> <p>Each documentation page is a static Markdown file in the repository, e.g. <a href="https://github.com/iterative/dvc.org/blob/main/content/docs/command-reference/index.md" target="_blank" rel="nofollow noopener noreferrer">example here</a>. It is rendered dynamically in the browser, no preprocessing is required. It means that tech writers or contributors need to write/edit a Markdown file, create a pull request and merge it into the master branch of the <a href="https://github.com/iterative/dvc.org" target="_blank" rel="nofollow noopener noreferrer">repository.</a> The complete <a href="https://github.com/iterative/dvc.org/blob/main/README.md#contributing" target="_blank" rel="nofollow noopener noreferrer">documentation contributing guide</a> describes the directory structure and locations for the different documentation parts.</p> <h3 id="dvcs-approach-to-documentation-work" style="position:relative;">DVC’s approach to documentation work<a href="#dvcs-approach-to-documentation-work" aria-label="dvcs approach to documentation work permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Documentation tasks and issues are maintained on our doc’s GitHub <a href="https://github.com/iterative/dvc.org/issues" target="_blank" rel="nofollow noopener noreferrer">issue tracker</a>. Changes to the documentation are made via pull requests on GitHub, and go through our standard review process which is the same for documentation and code. A technical writer would be trained in working with our current development process. It generally means that tech writers or contributors need to write/edit a Markdown file, use git and Github to create a pull request and publish it. The documentation <a href="https://github.com/iterative/dvc.org/blob/main/README.md#contributing" target="_blank" rel="nofollow noopener noreferrer">contributing guide</a> includes style conventions and other details. Documentation is considered of the same importance as code. Engineering team has a policy to write or update the relevant sections if something new is released. If it’s something too involved engineers may create a ticket and ask for help. There is one maintainer who is responsible for doing final reviews and merging the changes. In this sense, our documentation is very similar to any other open source project.</p> <h2 id="project-ideas-for-gsod19" style="position:relative;">Project ideas for GSoD’19<a href="#project-ideas-for-gsod19" aria-label="project ideas for gsod19 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We identified a number of ideas to work on and there are two major topics these ideas fall into. Both topics are pretty broad and we don’t expect we can completely cover them during this GSoD but hopefully we can make certain progress.</p> <p>First of all, we want to bring more structure and logic to our documentation to improve user onboarding experience. The goal is for a new user to have a clear path they can follow and understand what takeaways each part of the documentation provides. In particular, improving how <a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get Started</a>, <a href="https://dvc.org/doc/tutorial" target="_blank" rel="nofollow noopener noreferrer">Tutorials</a> and <a href="https://dvc.org/doc/tutorials/versioning" target="_blank" rel="nofollow noopener noreferrer">Examples</a> relate to each other, restructuring the existing <a href="https://dvc.org/doc/user-guide" target="_blank" rel="nofollow noopener noreferrer">User Guide</a> to explain basic concepts, and writing more use cases that resonate with ML engineers and data scientists.</p> <p>The other issue we would like to tackle is improving and expanding the existing reference docs — commands descriptions, examples, etc. It involves filling in the gaps and developing new sections, similar to <a href="https://dvc.org/doc/commands-reference/fetch" target="_blank" rel="nofollow noopener noreferrer">this one</a>. We would also love to see more illustrative materials.</p> <h3 id="project-1-improving-and-expanding-user-guide" style="position:relative;">Project 1: Improving and expanding User Guide<a href="#project-1-improving-and-expanding-user-guide" aria-label="project 1 improving and expanding user guide permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><strong>Description and details:</strong> Reviewing, restructuring and filling major gaps in the User Guide (introductory parts of the basic concepts of DVC), e.g. have a look at <a href="https://github.com/iterative/dvc.org/issues/144" target="_blank" rel="nofollow noopener noreferrer">this ticket</a> or <a href="https://github.com/iterative/dvc.org/issues/53" target="_blank" rel="nofollow noopener noreferrer">this one</a>.</p> <p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and <a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p> <h3 id="project-2-expanding-and-developing-new-tutorials-and-use-cases" style="position:relative;">Project 2: Expanding and developing new tutorials and use cases.<a href="#project-2-expanding-and-developing-new-tutorials-and-use-cases" aria-label="project 2 expanding and developing new tutorials and use cases permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><strong>Description and details:</strong> We already have some requests for more tutorials, e.g. <a href="https://github.com/iterative/dvc.org/issues/96" target="_blank" rel="nofollow noopener noreferrer">this ticket</a>. Here is another good <a href="https://github.com/iterative/dvc.org/issues/194" target="_blank" rel="nofollow noopener noreferrer">use case request</a> . If you are going to work on this project you would need some domain knowledge, preferably some basic ML or data science experience.</p> <p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and <a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p> <h3 id="project-3-improving-new-user-onboarding" style="position:relative;">Project 3: Improving new user onboarding<a href="#project-3-improving-new-user-onboarding" aria-label="project 3 improving new user onboarding permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><strong>Description and details:</strong> Analyze and restructure user walkthrough across <a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">Get started</a>, <a href="https://dvc.org/doc/tutorial" target="_blank" rel="nofollow noopener noreferrer">Tutorials</a> and <a href="https://dvc.org/doc/tutorials/versioning" target="_blank" rel="nofollow noopener noreferrer">Examples</a>. These three have one thing in common — hands-on experience with DVC. If you choose this project, we will work together to come up with a better location for the Examples (to move them out of the Get Started shadow), and a better location for the Tutorials (to reference external tutorials that were developed by our community members and published on different platforms).</p> <p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and <a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p> <h3 id="project-4-improving-commands-reference" style="position:relative;">Project 4: Improving commands reference<a href="#project-4-improving-commands-reference" aria-label="project 4 improving commands reference permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><strong>Description and details:</strong> We will work on improving our <a href="https://dvc.org/doc/commands-reference" target="_blank" rel="nofollow noopener noreferrer">Commands reference</a> section. This includes expanding and filling in the gaps. One of the biggest pain points right now are Examples. Users want them to be <a href="https://github.com/iterative/dvc.org/issues/198" target="_blank" rel="nofollow noopener noreferrer">easy to run and try</a> and here is a lot to be done in terms of improvement. We have a good example of how is should be done <a href="https://dvc.org/doc/commands-reference/fetch" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and <a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p> <h3 id="project-5-describe-and-integrate-dvc-packages" style="position:relative;">Project 5: Describe and integrate “DVC packages”<a href="#project-5-describe-and-integrate-dvc-packages" aria-label="project 5 describe and integrate dvc packages permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><strong>Description and details:</strong> Describe the brand new feature “DVC packages” and integrate it with the rest of the documentation. We have been working hard to release a few new commands to help with datasets management (have a look at <a href="https://github.com/iterative/dvc/issues/1487" target="_blank" rel="nofollow noopener noreferrer">this ticket</a>). It’s a major feature that deserves its place in the Get Started, Use cases, Commands Reference, etc.</p> <p><strong>Mentors</strong>: <a href="https://github.com/shcheklein" target="_blank" rel="nofollow noopener noreferrer">@shcheklein</a> and <a href="https://github.com/dmpetrov" target="_blank" rel="nofollow noopener noreferrer">@dmpetrov</a></p> <p>The ideas we outline above are just an example of what we can work on. We are open for any other suggestions and would like to work together with the technical writer to make the contribution experience both useful and enjoyable for all parties involved. If you have any suggestions or questions we would love to hear from you => DVC.org/support and our DMs on <a href="https://twitter.com/DVCorg" target="_blank" rel="nofollow noopener noreferrer">Twitter</a> are always open!</p> <hr> <p>Special thanks to the <a href="https://numfocus.org/" target="_blank" rel="nofollow noopener noreferrer">NumFOCUS</a> for the ideas list inspiration.</p> <p>If you are a tech writer — check the <a href="https://developers.google.com/season-of-docs/docs/tech-writer-guide" target="_blank" rel="nofollow noopener noreferrer">Technical writer guide</a>. From April 30, 2019 you can see the list of participating open source organizations on the <a href="https://g.co/seasonofdocs" target="_blank" rel="nofollow noopener noreferrer">Season of Docs website</a>. The application period for technical writers opens on <strong>May 29, 2019</strong> and ends on June 28, 2019.</p>https://dvc.org/blog/april-19-dvc-heartbeathttps://dvc.org/blog/april-19-dvc-heartbeatThu, 18 Apr 2019 00:00:00 GMT<h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We have some exciting news to share this month!</p> <p>DVC is going to <a href="https://us.pycon.org/2019/" target="_blank" rel="nofollow noopener noreferrer">PyCon 2019</a>! It is the first conference that we attend as a team. When we say ‘team’ — we mean it. Our engineers are flying from all over the globe to get together offline and catch up with fellow Pythonistas.</p> <p>The <a href="https://us.pycon.org/2019/schedule/talks/list/" target="_blank" rel="nofollow noopener noreferrer">speaker pipeline</a> is amazing! DVC creator Dmitry Petrov is giving a talk on <a href="https://us.pycon.org/2019/schedule/presentation/176/" target="_blank" rel="nofollow noopener noreferrer">Machine learning model and dataset versioning practices</a>.</p> <p>Stop by our booth at the Startup Row on Saturday, May 4, reach out and let us know that you are willing to chat, or simply find a person with a huge DVC owl on their shirt!</p> <p>Speaking of the owls — DVC has done some rebranding recently and we love our new logo. Special thanks to <a href="https://99designs.com/" target="_blank" rel="nofollow noopener noreferrer">99designs.com</a> for building a great platform for finding trusted designers.</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 700px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/91d26fd1613290e118c7a4ad1fc5a088/39600/trusted-designers.png" alt="trusted designers" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>DVC is moving fast (almost as fast as my two-year-old). We do our best to keep up and totally love all the buzz in our community channels lately!</p> <p>Here is a number of interesting reads that caught our eye:</p> <ul> <li><strong><a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" target="_blank" rel="nofollow noopener noreferrer">A walkthrough of DVC</a> by <a href="https://www.linkedin.com/in/bert-besser-284564182/" target="_blank" rel="nofollow noopener noreferrer">Bert Besser</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://blog.codecentric.de/en/2019/03/walkthrough-dvc/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">A walkthrough of DVC — codecentric AG Blog</h4> <div class="elp-description">This post is on how to systematially organize Machine Learning (ML) model development. A model’s performance improves…</div> <div class="elp-link">blog.codecentric.de</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-04-18/walkthrough-of-dvc-1c1b72dfeddae88a4249d5fefe8d3cc6.png" alt="A walkthrough of DVC — codecentric AG Blog"> </div> </a> </section> <p></p> <p>A great article about using DVC with a quite advanced scenario and docker. If you haven’t had a chance to try <a href="http://dvc.org/" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a> yet — this is a great comprehensive read on why you should do so right away.</p> <ul> <li><strong><a href="https://github.com/EthicalML/state-of-mlops-2019" target="_blank" rel="nofollow noopener noreferrer">The state of machine learning operations</a> by <a href="https://www.linkedin.com/in/axsaucedo/" target="_blank" rel="nofollow noopener noreferrer">Alejandro Saucedo</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://github.com/EthicalML/state-of-mlops-2019" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">The state of machine learning operations</h4> <div class="elp-description">Contribute to EthicalML/state-of-mlops-2019 development by creating an account on GitHub.</div> <div class="elp-link">github.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-04-18/the-state-of-machine-learning-operations-c6493fc09702d356e3cc7ced2711e3e3.jpeg" alt="The state of machine learning operations"> </div> </a> </section> <p></p> <p>A short (only 8 minutes!) and inspiring talk by Alejandro Saucedo at FOSDEM. Alejandro covers the key trends in machine learning operations, as well as most recent open source tools and frameworks. Focused on reproducibility, monitoring and explainability, this lightning talk is a great snapshot of the current state of ML operations.</p> <ul> <li><strong><a href="https://hackernoon.com/interview-with-kaggle-grandmaster-senior-cv-engineer-at-lyft-dr-vladimir-i-iglovikov-9938e1fc7c" target="_blank" rel="nofollow noopener noreferrer">Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov</a> by <a href="https://twitter.com/bhutanisanyam1" target="_blank" rel="nofollow noopener noreferrer">Sanyam Bhutani</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://hackernoon.com/interview-with-kaggle-grandmaster-senior-cv-engineer-at-lyft-dr-vladimir-i-iglovikov-9938e1fc7c" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov</h4> <div class="elp-description">Part 24 of The series where I interview my heroes.</div> <div class="elp-link">hackernoon.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-04-18/interview-with-kaggle-grandmaster-d1bc437a22ebae88bba9e06d5f166c06.jpeg" alt="Interview with Kaggle Grandmaster, Senior Computer Vision Engineer at Lyft: Dr. Vladimir I. Iglovikov"> </div> </a> </section> <p></p> <blockquote> <p>There is no way you will become Kaggle Master and not learn how to approach anew, the unknown problem in a fast hacking way with a very high number of iterations per unit of time. This skill in the world of competitive learning is the question of survival</p> </blockquote> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We are sifting through the issues and discussions and share with you the most interesting takeaways.</p> <h3 id="q-what-are-the-system-requirements-to-install-dvc-type-of-operating-system-dependencies-of-another-application-as-git-memory-cpu-etc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/552098155861114891" target="_blank" rel="nofollow noopener noreferrer">What are the system requirements to install DVC (type of operating system, dependencies of another application (as GIT), memory, cpu, etc).</a><a href="#q-what-are-the-system-requirements-to-install-dvc-type-of-operating-system-dependencies-of-another-application-as-git-memory-cpu-etc" aria-label="q what are the system requirements to install dvc type of operating system dependencies of another application as git memory cpu etc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <ul> <li> <p>It supports Windows, Mac, Linux. Python 2 and 3.</p> </li> <li> <p>No specific CPU or RAM requirements — it’s a lightweight command line tool and should be able run pretty much everywhere you can run Python.</p> </li> <li> <p>It depends on a few Python libraries that it installs as dependencies (they are specified in the <a href="https://github.com/iterative/dvc/blob/master/setup.py" target="_blank" rel="nofollow noopener noreferrer"><code>setup.py</code></a>).</p> </li> <li> <p>It does not depend on Git and theoretically could be run without any SCM. Running it on top of a Git repository however is recommended and gives you an ability to actually save history of datasets, models, etc (even though it does not put them into Git directly).</p> </li> </ul> <h3 id="q-do-i-have-to-buy-a-server-license-to-run-dvc-do-you-have-this" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/560212552638791706" target="_blank" rel="nofollow noopener noreferrer">Do I have to buy a server license to run DVC, do you have this?</a><a href="#q-do-i-have-to-buy-a-server-license-to-run-dvc-do-you-have-this" aria-label="q do i have to buy a server license to run dvc do you have this permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>No server licenses for DVC. It is 100% free and open source.</p> <h3 id="q-what-is-the-storage-limit-when-using-dvc" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/560154903331340289" target="_blank" rel="nofollow noopener noreferrer">What is the storage limit when using DVC?</a><a href="#q-what-is-the-storage-limit-when-using-dvc" aria-label="q what is the storage limit when using dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>I am trying to version control datasets and models with >10 GB (Potentially even bigger). Can DVC handle this?</p> <p>There is no limit. None enforced by DVC itself. It depends on the size of your local or <a href="https://dvc.org/doc/commands-reference/remote" target="_blank" rel="nofollow noopener noreferrer">remote storages</a>. You need to have some space available on S3, your SSH server or other storage you are using to keep these data files, models and their version, which you would like to store.</p> <h3 id="q-how-does-dvc-know-the-sequence-of-stages-to-run" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/553731815228178433" target="_blank" rel="nofollow noopener noreferrer">How does DVC know the sequence of stages to run</a>?<a href="#q-how-does-dvc-know-the-sequence-of-stages-to-run" aria-label="q how does dvc know the sequence of stages to run permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>How does it connect them? Does it see that there is a dependency which is outputted from the first run?</p> <p>DVC figures out the pipeline by looking at the dependencies and outputs of the stages. For example, having the following:</p> <p></p><div id="gist95747345" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-heartbeat-dvc-run-2019-04-sh" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-dvc-run-2019-04.sh content, created by SvetaGr on 05:45PM on April 16, 2019."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-dvc-run-2019-04.sh"> <tbody><tr> <td id="file-heartbeat-dvc-run-2019-04-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-heartbeat-dvc-run-2019-04-sh-LC1" class="blob-code blob-code-inner js-file-line">$ dvc run -f download.dvc \</td> </tr> <tr> <td id="file-heartbeat-dvc-run-2019-04-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-heartbeat-dvc-run-2019-04-sh-LC2" class="blob-code blob-code-inner js-file-line"> -o joke.txt \</td> </tr> <tr> <td id="file-heartbeat-dvc-run-2019-04-sh-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-heartbeat-dvc-run-2019-04-sh-LC3" class="blob-code blob-code-inner js-file-line"> "curl https://geek-jokes.sameerkumar.website/api > joke.txt"</td> </tr> <tr> <td id="file-heartbeat-dvc-run-2019-04-sh-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-heartbeat-dvc-run-2019-04-sh-LC4" class="blob-code blob-code-inner js-file-line">$ dvc run -f duplicate.dvc \</td> </tr> <tr> <td id="file-heartbeat-dvc-run-2019-04-sh-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-heartbeat-dvc-run-2019-04-sh-LC5" class="blob-code blob-code-inner js-file-line"> -d joke.txt \</td> </tr> <tr> <td id="file-heartbeat-dvc-run-2019-04-sh-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-heartbeat-dvc-run-2019-04-sh-LC6" class="blob-code blob-code-inner js-file-line"> -o dulpicate.txt \</td> </tr> <tr> <td id="file-heartbeat-dvc-run-2019-04-sh-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-heartbeat-dvc-run-2019-04-sh-LC7" class="blob-code blob-code-inner js-file-line"> "cat joke.txt joke.txt > duplicate.txt"</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/SvetaGr/a2a28fbc9db0a675422785bc5f925e14/raw/3802fa1b440a2b798568e0cac1be81ae10dd2acd/heartbeat-dvc-run-2019-04.sh" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/SvetaGr/a2a28fbc9db0a675422785bc5f925e14#file-heartbeat-dvc-run-2019-04-sh" class="Link--inTextBlock"> heartbeat-dvc-run-2019-04.sh </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p>you will end up with two stages: <code>download.dvc</code> and <code>duplicate.dvc</code>. The download one will have <code>joke.txt</code> as an output . The duplicate one defined <code>joke.txt</code> as a dependency, as it is the same file. DVC detects that and creates a pipeline by joining those stages.</p> <p>You can inspect the content of each stage file <a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">here</a> (they are human readable).</p> <h3 id="q-is-it-possible-to-use-the-same-data-of-a-remote-in-two-different-repositories" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/560022999848321026" target="_blank" rel="nofollow noopener noreferrer">Is it possible to use the same data of a remote in two different repositories?</a><a href="#q-is-it-possible-to-use-the-same-data-of-a-remote-in-two-different-repositories" aria-label="q is it possible to use the same data of a remote in two different repositories permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>(e.g. in one repo <a href="https://dvc.org/doc/command-reference/pull#-r"><code>run dvc pull -r my_remote</code></a> to pull some data and running the same command in a different git repo should also pull the same)</p> <p>Yes! It’s a frequent scenario for multiple repos to share remotes and even local cache. DVC file serves as a link to the actual data. If you add the same DVC file (e.g. <code>data.dvc</code>) to the new repo and do <a href="https://dvc.org/doc/command-reference/pull#-r"><code>dvc pull -r remotename data.dvc</code></a>- it will fetch data. You have to use <a href="https://dvc.org/doc/command-reference/remote/add"><code>dvc remote add</code></a> first to specify the coordinates of the remote storage you would like to share in every project. Alternatively (check out the question below), you could use <code>--global</code> to specify a single default remote (and/or cache dir) per machine.</p> <h3 id="q-could-i-set-a-global-remote-server-instead-of-config-in-each-project" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/559653121228275727" target="_blank" rel="nofollow noopener noreferrer">Could I set a global remote server, instead of config in each project?</a><a href="#q-could-i-set-a-global-remote-server-instead-of-config-in-each-project" aria-label="q could i set a global remote server instead of config in each project permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Use <code>--global</code> when you specify the remote settings. Then remote will be visible for all projects on the same machine. <code>--global</code> — saves remote configuration to the global config (e.g. <code>~/.config/dvc/config</code>) instead of a per project one — <code>.dvc/config</code>. See more details <a href="https://dvc.org/doc/commands-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <h3 id="q-how-do-i-version-a-large-dataset-in-s3-or-any-other-storage" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/554679392823934977" target="_blank" rel="nofollow noopener noreferrer">How do I version a large dataset in S3 or any other storage?</a><a href="#q-how-do-i-version-a-large-dataset-in-s3-or-any-other-storage" aria-label="q how do i version a large dataset in s3 or any other storage permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>We would recommend to skim through our <a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">get started</a> tutorial, to summarize the data versioning process of DVC:</p> <ul> <li>You create stage (aka DVC) files by adding, importing files (<a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> / <a href="https://dvc.org/doc/command-reference/import"><code>dvc import</code></a>) , or run a command to generate files:</li> </ul> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">--out</span> file.csv <span class="token string">"wget https://example.com/file.csv"</span></span></code></pre></div> <ul> <li> <p>This stage files are tracked by <code>git</code></p> </li> <li> <p>You use git to retrieve previous stage files (e.g. <code>git checkout v1.0</code>)</p> </li> <li> <p>Then use <a href="https://dvc.org/doc/command-reference/checkout"><code>dvc checkout</code></a> to retrieve all the files related by those stage files</p> </li> </ul> <p>All your files (with each different version) are stored in a <code>.dvc/cache</code> directory, that you sync with a remote file storage (for example, S3) using the <a href="https://dvc.org/doc/command-reference/push"><code>dvc push</code></a> or <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a> commands (analogous to a <code>git push</code> / <code>git pull</code>, but instead of syncing your <code>.git</code>, you are syncing your <a href="https://dvc.org/doc/user-guide/project-structure/dvc-files"><code>.dvc</code></a> directory) on a remote repository (let’s say an S3 bucket).</p> <h3 id="q-how-do-i-moverename-a-dvc-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/558216007684980736" target="_blank" rel="nofollow noopener noreferrer">How do I move/rename a DVC-file?</a><a href="#q-how-do-i-moverename-a-dvc-file" aria-label="q how do i moverename a dvc file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>If you need to move your dvc file somewhere, it is pretty easy, even if done manually:</p> <p></p><div id="gist95752643" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-heartbeat-dvc-rename-sh" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-dvc-rename.sh content, created by SvetaGr on 12:45AM on April 17, 2019."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-dvc-rename.sh"> <tbody><tr> <td id="file-heartbeat-dvc-rename-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-heartbeat-dvc-rename-sh-LC1" class="blob-code blob-code-inner js-file-line">$ mv my.dvc data/my.dvc</td> </tr> <tr> <td id="file-heartbeat-dvc-rename-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-heartbeat-dvc-rename-sh-LC2" class="blob-code blob-code-inner js-file-line"># and now open my.dvc with your favorite editor and change wdir in it to 'wdir: ../'.</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/SvetaGr/b25a5b45773bf94d36e60d48462502f4/raw/b9f920208a50afb55bda6c7527081babfcc323fe/heartbeat-dvc-rename.sh" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/SvetaGr/b25a5b45773bf94d36e60d48462502f4#file-heartbeat-dvc-rename-sh" class="Link--inTextBlock"> heartbeat-dvc-rename.sh </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <h3 id="q-i-performed-dvc-push-of-a-file-to-a-remote-on-the-remote-there-is-created-a-directory-called-8f-with-a-file-inside-called-2ec34faf91ff15ef64abf3fbffa7ee-the-original-csv-file-doesnt-appear-on-the-remote-is-that-expected-behaviour" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/555431645402890255" target="_blank" rel="nofollow noopener noreferrer">I performed <code>dvc push</code> of a file to a remote. On the remote there is created a directory called <code>8f</code> with a file inside called <code>2ec34faf91ff15ef64abf3fbffa7ee</code>. The original CSV file doesn’t appear on the remote. Is that expected behaviour?</a><a href="#q-i-performed-dvc-push-of-a-file-to-a-remote-on-the-remote-there-is-created-a-directory-called-8f-with-a-file-inside-called-2ec34faf91ff15ef64abf3fbffa7ee-the-original-csv-file-doesnt-appear-on-the-remote-is-that-expected-behaviour" aria-label="q i performed dvc push of a file to a remote on the remote there is created a directory called 8f with a file inside called 2ec34faf91ff15ef64abf3fbffa7ee the original csv file doesnt appear on the remote is that expected behaviour permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>This is an expected behavior. DVC saves files under the name created from their checksum in order to prevent duplication. If you delete “pushed” file in your project directory and perform <a href="https://dvc.org/doc/command-reference/pull"><code>dvc pull</code></a>, DVC will take care of pulling the file and renaming it to “original” name.</p> <p>Below are some details about how DVC cache works, just to illustrate the logic. When you add a data source:</p> <p></p><div id="gist95752678" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-heartbeat-remote-file-naming-sh" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-remote-file-naming.sh content, created by SvetaGr on 12:49AM on April 17, 2019."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-remote-file-naming.sh"> <tbody><tr> <td id="file-heartbeat-remote-file-naming-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-heartbeat-remote-file-naming-sh-LC1" class="blob-code blob-code-inner js-file-line">$ echo "foo" > data.txt</td> </tr> <tr> <td id="file-heartbeat-remote-file-naming-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-heartbeat-remote-file-naming-sh-LC2" class="blob-code blob-code-inner js-file-line">$ dvc add data.txt</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/SvetaGr/b69fa8ce36bcce00ecd69e7f2d7ccd2e/raw/34017336326e3773f2e3a490e1f66265025f8c81/heartbeat-remote-file-naming.sh" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/SvetaGr/b69fa8ce36bcce00ecd69e7f2d7ccd2e#file-heartbeat-remote-file-naming-sh" class="Link--inTextBlock"> heartbeat-remote-file-naming.sh </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p>It computes the (md5) checksum of the file and generates a DVC file with related information:</p> <p></p><div id="gist95752688" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-heartbeat-dvc-file-2019-04-yaml" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-yaml" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-dvc-file-2019-04.yaml content, created by SvetaGr on 12:50AM on April 17, 2019."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-dvc-file-2019-04.yaml"> <tbody><tr> <td id="file-heartbeat-dvc-file-2019-04-yaml-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-heartbeat-dvc-file-2019-04-yaml-LC1" class="blob-code blob-code-inner js-file-line">md5: 3bccbf004063977442029334c3448687</td> </tr> <tr> <td id="file-heartbeat-dvc-file-2019-04-yaml-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-heartbeat-dvc-file-2019-04-yaml-LC2" class="blob-code blob-code-inner js-file-line">outs:</td> </tr> <tr> <td id="file-heartbeat-dvc-file-2019-04-yaml-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-heartbeat-dvc-file-2019-04-yaml-LC3" class="blob-code blob-code-inner js-file-line">- cache: true</td> </tr> <tr> <td id="file-heartbeat-dvc-file-2019-04-yaml-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-heartbeat-dvc-file-2019-04-yaml-LC4" class="blob-code blob-code-inner js-file-line"> md5: d3b07384d113edec49eaa6238ad5ff00</td> </tr> <tr> <td id="file-heartbeat-dvc-file-2019-04-yaml-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-heartbeat-dvc-file-2019-04-yaml-LC5" class="blob-code blob-code-inner js-file-line"> metric: false</td> </tr> <tr> <td id="file-heartbeat-dvc-file-2019-04-yaml-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-heartbeat-dvc-file-2019-04-yaml-LC6" class="blob-code blob-code-inner js-file-line"> path: data.txt</td> </tr> <tr> <td id="file-heartbeat-dvc-file-2019-04-yaml-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-heartbeat-dvc-file-2019-04-yaml-LC7" class="blob-code blob-code-inner js-file-line">wdir: ..</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/SvetaGr/110ae76df929654ec573ea9e4b1e1980/raw/3ccd7b7ab89e1e4246c1d8c83d6051df2379bd6d/heartbeat-dvc-file-2019-04.yaml" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/SvetaGr/110ae76df929654ec573ea9e4b1e1980#file-heartbeat-dvc-file-2019-04-yaml" class="Link--inTextBlock"> heartbeat-dvc-file-2019-04.yaml </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p>The original file is moved to the cache and a link or copy (depending on your filesystem) is created to replace it on your working space:</p> <p></p><div id="gist95752708" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-heartbeat-cache-structure-2019-04-sh" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="heartbeat-cache-structure-2019-04.sh content, created by SvetaGr on 12:53AM on April 17, 2019."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="heartbeat-cache-structure-2019-04.sh"> <tbody><tr> <td id="file-heartbeat-cache-structure-2019-04-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-heartbeat-cache-structure-2019-04-sh-LC1" class="blob-code blob-code-inner js-file-line">.dvc/cache</td> </tr> <tr> <td id="file-heartbeat-cache-structure-2019-04-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-heartbeat-cache-structure-2019-04-sh-LC2" class="blob-code blob-code-inner js-file-line">└── d3</td> </tr> <tr> <td id="file-heartbeat-cache-structure-2019-04-sh-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-heartbeat-cache-structure-2019-04-sh-LC3" class="blob-code blob-code-inner js-file-line"> └── b07384d113edec49eaa6238ad5ff00</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/SvetaGr/133cb93e5a21c6f21a86f8709ed39ea9/raw/540aa50da9bb891da01030a8877688b74eecc20e/heartbeat-cache-structure-2019-04.sh" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/SvetaGr/133cb93e5a21c6f21a86f8709ed39ea9#file-heartbeat-cache-structure-2019-04-sh" class="Link--inTextBlock"> heartbeat-cache-structure-2019-04.sh </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <h3 id="q-is-it-possible-to-integrate-dvc-with-our-in-house-tools-developed-in-python" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/553570391000481802" target="_blank" rel="nofollow noopener noreferrer">Is it possible to integrate dvc with our in-house tools developed in Python?</a><a href="#q-is-it-possible-to-integrate-dvc-with-our-in-house-tools-developed-in-python" aria-label="q is it possible to integrate dvc with our in house tools developed in python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Absolutely! There are three ways you could interact with DVC:</p> <ol> <li> <p>Use <a href="https://docs.python.org/3/library/subprocess.html" target="_blank" rel="nofollow noopener noreferrer">subprocess</a> to launch DVC</p> </li> <li> <p>Use <code>from dvc.main import main</code> and use it with regular CLI logic like <code>ret = main(‘add’, ‘foo’)</code></p> </li> <li> <p>Use our internal API (see <code>dvc/repo</code> and <code>dvc/command</code> in our source to get a grasp of it). It is not officially public yet, and we don’t have any special docs for it, but it is fairly stable and could definitely be used for a POC. We’ll add docs and all the official stuff for it in the not-so-distant future.</p> </li> </ol> <h3 id="q-can-i-still-track-the-linkage-between-data-and-model-without-using-dvc-run-and-a-graph-of-tasks-basically-what-would-like-extremely-minimal-dvc-invasion-into-my-git-repo-for-an-existing-machine-learning-application" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/555750217522216990" target="_blank" rel="nofollow noopener noreferrer">Can I still track the linkage between data and model without using <code>dvc run</code></a> and a graph of tasks? Basically what would like extremely minimal DVC invasion into my GIT repo for an existing machine learning application?<a href="#q-can-i-still-track-the-linkage-between-data-and-model-without-using-dvc-run-and-a-graph-of-tasks-basically-what-would-like-extremely-minimal-dvc-invasion-into-my-git-repo-for-an-existing-machine-learning-application" aria-label="q can i still track the linkage between data and model without using dvc run and a graph of tasks basically what would like extremely minimal dvc invasion into my git repo for an existing machine learning application permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There are two options:</p> <ol> <li> <p>Use <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> to track models and/or input datasets. It should be enough if you use <code>git commit</code> on DVC files produced by <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a>. This is the very minimum you can get with DVC and it does not require using DVC run. Check the first part (up to the Pipelines/Add transformations section) of the DVC <a href="https://dvc.org/doc/get-started" target="_blank" rel="nofollow noopener noreferrer">get started</a>.</p> </li> <li> <p>You could use <code>--no-exec</code> in <code>dvc run</code> and then just <a href="https://dvc.org/doc/command-reference/commit"><code>dvc commit</code></a> and <code>git commit</code> the results. That way you’ll get your DVC files with all the linkages, without having to actually run your commands through DVC.</p> </li> </ol> <p>If you have any questions, concerns or ideas, let us know <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and our stellar team will get back to you in no time.</p>https://dvc.org/blog/march-19-dvc-heartbeathttps://dvc.org/blog/march-19-dvc-heartbeatTue, 05 Mar 2019 00:00:00 GMT<p>This is the very first issue of the DVC❤️Heartbeat. Every month we will be sharing our news, findings, interesting reads, community takeaways, and everything along the way.</p> <p>Some of those are related to our brainchild <a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.</p> <h2 id="news-and-links" style="position:relative;">News and links<a href="#news-and-links" aria-label="news and links permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We read a ton of articles and posts every day and here are a few that caught our eye. Well-written, offering a different perspective and definitely worth checking.</p> <ul> <li><strong><a href="https://veekaybee.github.io/2019/02/13/data-science-is-different/" target="_blank" rel="nofollow noopener noreferrer">Data science is different now</a> by <a href="https://veekaybee.github.io/" target="_blank" rel="nofollow noopener noreferrer">Vicki Boykis</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://veekaybee.github.io/2019/02/13/data-science-is-different/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data science is different now</h4> <div class="elp-description">Woman holding a balance, Vermeer 1664 What do you think of when you read the phrase 'data science'? It's probably some…</div> <div class="elp-link">veekaybee.github.io</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-03-05/data-science-is-different-now-ef77fccb7554382d75f7471a2564633f.png" alt="Data science is different now"> </div> </a> </section> <p></p> <blockquote> <p>What is becoming clear is that, in the late stage of the hype cycle, data science is asymptotically moving closer to engineering, and the <a href="https://www.youtube.com/watch?v=frQeK8xo9Ls" target="_blank" rel="nofollow noopener noreferrer">skills that data scientists need</a> moving forward are less visualization and statistics-based, and <a href="https://tech.trivago.com/2018/12/03/teardown-rebuild-migrating-from-hive-to-pyspark/" target="_blank" rel="nofollow noopener noreferrer">more in line with traditional computer science curricula</a>.</p> </blockquote> <ul> <li><strong><a href="https://emilygorcenski.com/post/data-versioning/" target="_blank" rel="nofollow noopener noreferrer">Data Versioning</a> by <a href="https://emilygorcenski.com/" target="_blank" rel="nofollow noopener noreferrer">Emily F. Gorcenski</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://emilygorcenski.com/post/data-versioning/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Data Versioning</h4> <div class="elp-description">Productionizing machine learning/AI/data science is a challenge. Not only are the outputs of machine-learning…</div> <div class="elp-link">emilygorcenski.com</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-03-05/data-versioning-44da0cbe3c804f68cee118e39b9ac318.jpeg" alt="Data Versioning"> </div> </a> </section> <p></p> <blockquote> <p>I want to explore how the degrees of freedom in versioning machine learning systems poses a unique challenge. I’ll identify four key axes on which machine learning systems have a notion of version, along with some brief recommendations for how to simplify this a bit.</p> </blockquote> <ul> <li><strong><a href="https://blog.mi.hdm-stuttgart.de/index.php/2019/02/26/reproducibility-in-ml/" target="_blank" rel="nofollow noopener noreferrer">Reproducibility in Machine Learning</a> by <a href="https://blog.mi.hdm-stuttgart.de/index.php/author/pf023/" target="_blank" rel="nofollow noopener noreferrer">Pascal Fecht</a></strong></li> </ul> <p> </p><section class="elp-content-holder"> <a href="https://emilygorcenski.com/post/data-versioning/" class="external-link-preview" target="_blank" rel="noopener noreferrer"> <div class="elp-description-holder"> <h4 class="elp-title">Reproducibility in Machine Learning | Computer Science Blog</h4> <div class="elp-description">The rise of Machine Learning has led to changes across all areas of computer science. From a very abstract point of…</div> <div class="elp-link">blog.mi.hdm-stuttgart.de</div> </div> <div class="elp-image-holder"> <img src="https://dvc.org/2019-03-05/reproducibility-in-machine-learning-4fa14e52fb2fa408a0b6870280e31566.jpeg" alt="Reproducibility in Machine Learning | Computer Science Blog"> </div> </a> </section> <p></p> <blockquote> <p>…the objective of this post is not to philosophize about the dangers and dark sides of AI. In fact, this post aims to work out common challenges in reproducibility for machine learning and shows programming differences to other areas of Computer Science. Secondly, we will see practices and workflows to create a higher grade of reproducibility in machine learning algorithms.</p> </blockquote> <hr> <h2 id="discord-gems" style="position:relative;">Discord gems<a href="#discord-gems" aria-label="discord gems permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.</p> <p>We will be sifting through the issues and discussions and share the most interesting takeaways.</p> <h3 id="q-edit-and-define-dvc-files-manually-in-a-makefile-style" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/541622187296161816" target="_blank" rel="nofollow noopener noreferrer">Edit and define DVC files manually, in a Makefile style</a><a href="#q-edit-and-define-dvc-files-manually-in-a-makefile-style" aria-label="q edit and define dvc files manually in a makefile style permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There is no separate guide for that, but it is very straight forward. See <a href="https://dvc.org/doc/user-guide/project-structure" target="_blank" rel="nofollow noopener noreferrer">DVC file format</a> description for how DVC file looks inside in general. All <a href="https://dvc.org/doc/command-reference/add"><code>dvc add</code></a> or <code>dvc run</code> does is just computing <code>md5</code> fields in it, that is all. You could write your DVC-file and then run <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> that will run a command(if any) and compute all needed checksums,<a href="https://discordapp.com/channels/485586884165107732/485586884165107734/541622187296161816" target="_blank" rel="nofollow noopener noreferrer">read more</a>.</p> <h3 id="q-best-practices-to-define-the-code-dependencies" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/547424240677158915" target="_blank" rel="nofollow noopener noreferrer">Best practices to define the code dependencies</a><a href="#q-best-practices-to-define-the-code-dependencies" aria-label="q best practices to define the code dependencies permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>There’s a ton of code in that project, and it’s very non-trivial to define the code dependencies for my training stage — there are a lot of imports going on, the training code is distributed across many modules, <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/547424240677158915" target="_blank" rel="nofollow noopener noreferrer">read more</a></p> <h3 id="q-azure-data-lake-support" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/548495589428428801" target="_blank" rel="nofollow noopener noreferrer">Azure data lake support</a><a href="#q-azure-data-lake-support" aria-label="q azure data lake support permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>DVC officially only supports regular Azure blob storage. Gen1 Data Lake should be accessible by the same interface, so configuring a regular azure remote for DVC should work. Seems like Gen2 Data Lake <a href="https://discordapp.com/channels/485586884165107732/485586884165107734/550546413197590539" target="_blank" rel="nofollow noopener noreferrer">has disable</a> blob API. If you know more details about the difference between Gen1 and Gen2, feel free to join <a href="https://dvc.org/chat" target="_blank" rel="nofollow noopener noreferrer">our community</a> and share this knowledge.</p> <h3 id="q-what-licence-dvc-is-released-under" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/542390986299539459" target="_blank" rel="nofollow noopener noreferrer">What licence DVC is released under</a><a href="#q-what-licence-dvc-is-released-under" aria-label="q what licence dvc is released under permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Apache 2.0. One of the <a href="https://opensource.org/licenses" target="_blank" rel="nofollow noopener noreferrer">most common</a> and permissible OSS licences.</p> <h3 id="q-setting-up-s3-compatible-remote" style="position:relative;">Q: Setting up S3 compatible remote<a href="#q-setting-up-s3-compatible-remote" aria-label="q setting up s3 compatible remote permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>(<a href="https://discordapp.com/channels/485586884165107732/485596304961962003/543445798868746278" target="_blank" rel="nofollow noopener noreferrer">Localstack</a>, <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/541466951474479115" target="_blank" rel="nofollow noopener noreferrer">wasabi</a>)</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote add</span> upstream s3://my-bucket </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> upstream region REGION_NAME </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc remote modify</span> upstream endpointurl <span class="token operator"><</span>url<span class="token operator">></span></span></code></pre></div> <p>Find and click the <code>S3 API compatible storage</code> on <a href="https://dvc.org/doc/commands-reference/remote/add" target="_blank" rel="nofollow noopener noreferrer">this page</a></p> <h3 id="q-why-dvc-creates-and-updates-gitignore-file" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/543914550173368332" target="_blank" rel="nofollow noopener noreferrer">Why DVC creates and updates <code>.gitignore</code> file?</a><a href="#q-why-dvc-creates-and-updates-gitignore-file" aria-label="q why dvc creates and updates gitignore file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>It adds your data files there, that are tracked by DVC, so that you don’t accidentally add them to git as well you can open it with file editor of your liking and see your data files listed there.</p> <h3 id="q-managing-data-and-pipelines-with-dvc-on-hdfs" style="position:relative;">Q: <a href="https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426" target="_blank" rel="nofollow noopener noreferrer">Managing data and pipelines with DVC on HDFS</a><a href="#q-managing-data-and-pipelines-with-dvc-on-hdfs" aria-label="q managing data and pipelines with dvc on hdfs permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>With DVC, you could connect your data sources from HDFS with your pipeline in your local project, by simply specifying it as an external dependency. For example let’s say your script <code>process.cmd</code> works on an input file on HDFS and then downloads a result to your local workspace, then with DVC it could look something like:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token parameter variable">-d</span> hdfs://example.com/home/shared/input <span class="token punctuation">\</span> <span class="token parameter variable">-d</span> process.cmd <span class="token punctuation">\</span> <span class="token parameter variable">-o</span> output process.cmd</span></code></pre></div> <p><a href="https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426" target="_blank" rel="nofollow noopener noreferrer">read more</a>.</p> <hr> <p>If you have any questions, concerns or ideas, let us know <a href="https://dvc.org/support" target="_blank" rel="nofollow noopener noreferrer">here</a> and our stellar team will get back to you in no time.</p>https://dvc.org/blog/ml-best-practices-in-pytorch-dev-conf-2018https://dvc.org/blog/ml-best-practices-in-pytorch-dev-conf-2018Thu, 18 Oct 2018 00:00:00 GMT<p>The issues discussed included applying traditional software development techniques like unit testing, CI/CD systems, automated deployment, version control, and more to the ML field. In this blog post, we will go over the best practices ideas from PTDC-18 and the future of ML tool developments.</p> <h2 id="1-engineering-practices-from-pytorch-developers" style="position:relative;">1. Engineering practices from PyTorch developers<a href="#1-engineering-practices-from-pytorch-developers" aria-label="1 engineering practices from pytorch developers permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In the PTDC-18 <a href="https://www.facebook.com/pytorch/videos/482401942168584/" target="_blank" rel="nofollow noopener noreferrer">keynote speech</a>, <strong>Jerome Pesenti</strong> described the motivation and goals of PyTorch project and what the future of machine learning looks like.</p> <h3 id="11-ml-tooling-future" style="position:relative;">1.1. ML tooling future<a href="#11-ml-tooling-future" aria-label="11 ml tooling future permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Regarding the future of ML, Jerome envisioned a “streamlined development, more accessible tools, breakthrough hardware, and more”. Talking about the gap huge gap between software engineering and ML engineering, Presenti said:</p> <blockquote> <p>Machine learning engineering is where we were in Software Engineering 20 years ago. A lot of things still need to be invented. We need to figure out what testing means, what CD (continuous delivery) means, we need to develop tools and environments that people can develop <strong>robust ML that does not have too many biases</strong> and does not overfit.</p> </blockquote> <p>In that gap lives many opportunities to develop new tools and services. We in the ML ecosystem are called upon to implement the future of machine learning tools. Traditional software engineering has many useful tools and techniques which can either be repurposed for Machine Learning development or used as a source for ideas in developing new tools.</p> <h3 id="12-pytorch-motivation" style="position:relative;">1.2. PyTorch motivation<a href="#12-pytorch-motivation" aria-label="12 pytorch motivation permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>PyTorch 1.0 implements one important engineering principle — “a seamless transition from AI research to production”. It helps to move AI technology from research into production as quickly as possible. In order to do that a few challenges were solved:</p> <ol> <li> <p><strong>Write code once</strong> — not have to rewrite or re-optimize code to go from research to prod.</p> </li> <li> <p><strong>Performance</strong> — training model on large datasets.</p> </li> <li> <p><strong>Other languages</strong> — not only Python which is great for prototyping but also C++ and other languages.</p> </li> <li> <p><strong>Scaling</strong> — deploy PyTorch at scale more easily.</p> </li> </ol> <h2 id="2-engineering-practices-for-software-20" style="position:relative;">2. Engineering practices for software 2.0<a href="#2-engineering-practices-for-software-20" aria-label="2 engineering practices for software 20 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <h3 id="21-melting-of-software-20-and-software-10" style="position:relative;">2.1. Melting of software 2.0 and software 1.0<a href="#21-melting-of-software-20-and-software-10" aria-label="21 melting of software 20 and software 10 permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p><strong>Andrej Karpathy</strong> from Tesla AI had a <a href="https://www.facebook.com/pytorch/videos/169366590639145/" target="_blank" rel="nofollow noopener noreferrer">dedicated talk</a> about best engineering practices in ML. He drew a contrast between traditional software development (software 1.0) with software utilizing Machine Learning techniques (software 2.0), saying that</p> <blockquote> <p>“software 2.0 code also has new feature demands, contains bugs, and requires iterations.”</p> </blockquote> <p>Meaning that ML development has a lifecycle similar to traditional software:</p> <blockquote> <p>“When you are working with these [neural] networks <strong>in production</strong> you are doing much more than that [training and measuring models]. You maintaining the codebase and that codebase is alive is just like 1.0 code.”</p> </blockquote> <p>Machine Learning models need to grow and develop feature-by-feature, bugs need to be found and fixed, and repeatable processes are a must, as in earlier non-ML software development practices.</p> <h3 id="22-software-20-best-practices" style="position:relative;">2.2. Software 2.0 best practices<a href="#22-software-20-best-practices" aria-label="22 software 20 best practices permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>Karpathy went on to describe how software 1.0 best practices can be used in software 2.0 (ML modeling):</p> <ol> <li> <p><strong>Test-driven development</strong> — test/train dataset separation is not enough since it describes only expected performance. Edge cases have to be tested to ensure the model performs as required. That requires incorporating more examples in datasets, or changing model architecture, or changing optimization functions.</p> </li> <li> <p><strong>Continues Integration and Continues Delivery</strong> (CI/CD) — Intelligently used of CI/CD can propel a team into rapid agile development of software systems. The phases of CI/CD jobs include: 1) ML model auto re-training when code or dataset changes; 2) running unit-tests; 3) easy access to the last model; 4) Auto-deployment to test and/or production systems.</p> </li> <li> <p><strong>Version Control</strong> — track all the changes in datasets (labels), not only code.</p> </li> <li> <p>Train a <strong>single model</strong> from scratch every time without using other pre-trained models. (External pre-trained models don’t count as far as I understand.) A chain of fine-tuning models very quickly disintegrates codebase. In software 1.0 a single <strong>monorepo</strong> is an analog of a single model which also helps to avoid disintegration.</p> </li> </ol> <p>This list of best practices shows how serious Tesla AI is about robust software which is not surprising for self-driving car area. Any company needs these practices in order to organize a manageable ML development process.</p> <h2 id="3-data-file-centric-tools" style="position:relative;">3. Data file-centric tools<a href="#3-data-file-centric-tools" aria-label="3 data file centric tools permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Frameworks and libraries like PyTorch make a significant step in machine learning tooling and bringing the best practices. However, frameworks and libraries might be not enough for many of the ML best practices. For example, dataset versioning, ML model versioning, continuous integration (CI) and continuous delivery (CD) requires manipulation and transferring data files. These can be done in a <strong>more efficient and natural way by data management tools</strong> and storage systems rather than libraries.</p> <p>The need for a machine learning artifact manipulation tool with <strong>data file-centric philosophy</strong> was the major motivation behind open source project that we created — Data Version Control (DVC) or <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a>.</p> <p>DVC connects Git with data files and machine learning pipelines which helps keep version control on machine learning models and datasets using familiar Git semantics coupled with the power of cloud storage systems such as Amazon’s S3, Google’s GCS, Microsoft’s Azure or bare-metal servers accessed by SSH.</p> <p>If PyTorch helps in organizing code inside an ML project then data-centric tools like DVC help organized different pieces of ML projects into a single workflow. The machine learning future requires both types of tools — code level and data file level.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Thus far only the first steps have been taken toward using machine learning tooling and the best machine learning practices. Mostly large companies are using these practices because they faced the problems a while ago. Best practices should be embraced by the entire industry which will help to bring machine learning to a higher new level.</p>https://dvc.org/blog/best-practices-of-orchestrating-python-and-r-code-in-ml-projectshttps://dvc.org/blog/best-practices-of-orchestrating-python-and-r-code-in-ml-projectsTue, 26 Sep 2017 00:00:00 GMT<p>Beside Git and shell scripting additional tools are developed to facilitate the development of predictive model in a multi-language environments. For fast data exchange between R and Python let’s use binary data file format <a href="https://blog.rstudio.com/2016/03/29/feather/" target="_blank" rel="nofollow noopener noreferrer">Feather</a>. Another language agnostic tool <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> can make the research reproducible — let’s use DVC to orchestrate R and Python code instead of a regular shell scripts.</p> <h2 id="machine-learning-with-r-and-python" style="position:relative;">Machine learning with R and Python<a href="#machine-learning-with-r-and-python" aria-label="machine learning with r and python permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Both R and Python are having powerful libraries/packages used for predictive modeling. Usually algorithms used for classification or regression are implemented in both languages and some scientist are using R while some of them preferring Python. In an example that was explained in previous <a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">tutorial</a> target variable was binary output and logistic regression was used as a training algorithm. One of the algorithms that could also be used for prediction is a popular <a href="https://en.wikipedia.org/wiki/Random_forest" target="_blank" rel="nofollow noopener noreferrer">Random Forest algorithm</a> which is implemented in both programming languages. Because of performances it was decided that Random Forest classifier should be implemented in Python (it shows better performances than random forest package in R).</p> <h2 id="r-example-used-for-dvc-demo" style="position:relative;">R example used for DVC demo<a href="#r-example-used-for-dvc-demo" aria-label="r example used for dvc demo permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We will use the same example from previous blog <a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">story</a>, add some Python codes and explain how Feather and DVC can simplify the development process in this combined environment.</p> <p>Let’s recall briefly the R codes from previous tutorial:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 335px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/68824bc8c4ac0c84edf737da9f1bfa01/31682/r-jobs.png" alt="R Jobs" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>R Jobs</em></p> <p>Input data are StackOverflow posts — an XML file. Predictive variables are created from text posts — relative importance <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf" target="_blank" rel="nofollow noopener noreferrer">tf-idf</a> of words among all available posts is calculated. With tf-idf matrices target is predicted and lasso logistic regression for predicting binary output is used. AUC is calculated on the test set and AUC metric is used on evaluation.</p> <p>Instead of using logistic regression in R we will write Python jobs in which we will try to use random forest as training model. Train_model.R and evaluate.R will be replaced with appropriate Python jobs.</p> <p>R codes can be seen <a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">here</a>.</p> <p>Code for <code>train_model_Python.py</code> is presented below:</p> <p></p><div id="gist73527556" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-train_model_python-py" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python" style="overflow: auto" tabindex="0" role="region" aria-label="train_model_Python.py content, created by Zoldin on 06:52AM on August 02, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="train_model_Python.py"> <tbody><tr> <td id="file-train_model_python-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-train_model_python-py-LC1" class="blob-code blob-code-inner js-file-line">import numpy as np</td> </tr> <tr> <td id="file-train_model_python-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-train_model_python-py-LC2" class="blob-code blob-code-inner js-file-line">from sklearn.ensemble import RandomForestClassifier</td> </tr> <tr> <td id="file-train_model_python-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-train_model_python-py-LC3" class="blob-code blob-code-inner js-file-line">import sys</td> </tr> <tr> <td id="file-train_model_python-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-train_model_python-py-LC4" class="blob-code blob-code-inner js-file-line">try: import cPickle as pickle # python2</td> </tr> <tr> <td id="file-train_model_python-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-train_model_python-py-LC5" class="blob-code blob-code-inner js-file-line">except: import pickle # python3</td> </tr> <tr> <td id="file-train_model_python-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-train_model_python-py-LC6" class="blob-code blob-code-inner js-file-line">from scipy import sparse</td> </tr> <tr> <td id="file-train_model_python-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-train_model_python-py-LC7" class="blob-code blob-code-inner js-file-line">from numpy import loadtxt</td> </tr> <tr> <td id="file-train_model_python-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-train_model_python-py-LC8" class="blob-code blob-code-inner js-file-line">import feather as ft</td> </tr> <tr> <td id="file-train_model_python-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-train_model_python-py-LC9" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model_python-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-train_model_python-py-LC10" class="blob-code blob-code-inner js-file-line">if len(sys.argv) != 4:</td> </tr> <tr> <td id="file-train_model_python-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-train_model_python-py-LC11" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('Arguments error. Usage:\n')</td> </tr> <tr> <td id="file-train_model_python-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-train_model_python-py-LC12" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('\tpython train_model.py INPUT_MATRIX_FILE SEED OUTPUT_MODEL_FILE\n')</td> </tr> <tr> <td id="file-train_model_python-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-train_model_python-py-LC13" class="blob-code blob-code-inner js-file-line"> sys.exit(1)</td> </tr> <tr> <td id="file-train_model_python-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-train_model_python-py-LC14" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model_python-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-train_model_python-py-LC15" class="blob-code blob-code-inner js-file-line">input = sys.argv[1]</td> </tr> <tr> <td id="file-train_model_python-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-train_model_python-py-LC16" class="blob-code blob-code-inner js-file-line">seed = int(sys.argv[2])</td> </tr> <tr> <td id="file-train_model_python-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-train_model_python-py-LC17" class="blob-code blob-code-inner js-file-line">output = sys.argv[3]</td> </tr> <tr> <td id="file-train_model_python-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-train_model_python-py-LC18" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model_python-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-train_model_python-py-LC19" class="blob-code blob-code-inner js-file-line">df = ft.read_dataframe(input)</td> </tr> <tr> <td id="file-train_model_python-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-train_model_python-py-LC20" class="blob-code blob-code-inner js-file-line">labels = df.loc[:,'label']</td> </tr> <tr> <td id="file-train_model_python-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-train_model_python-py-LC21" class="blob-code blob-code-inner js-file-line">x = df.loc[:, df.columns != 'label']</td> </tr> <tr> <td id="file-train_model_python-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-train_model_python-py-LC22" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model_python-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-train_model_python-py-LC23" class="blob-code blob-code-inner js-file-line">clf = RandomForestClassifier(n_estimators=100, n_jobs=2, random_state=seed)</td> </tr> <tr> <td id="file-train_model_python-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-train_model_python-py-LC24" class="blob-code blob-code-inner js-file-line">clf.fit(x, labels.ix[:,0])</td> </tr> <tr> <td id="file-train_model_python-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td> <td id="file-train_model_python-py-LC25" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model_python-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td> <td id="file-train_model_python-py-LC26" class="blob-code blob-code-inner js-file-line">with open(output, 'wb') as fd:</td> </tr> <tr> <td id="file-train_model_python-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td> <td id="file-train_model_python-py-LC27" class="blob-code blob-code-inner js-file-line"> pickle.dump(clf, fd)</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/Zoldin/b312897cc492608feef1eaeae7f6eabc/raw/8dad0f69067945b9b84f8d90a8cdbe52694e36f8/train_model_Python.py" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/Zoldin/b312897cc492608feef1eaeae7f6eabc#file-train_model_python-py" class="Link--inTextBlock"> train_model_Python.py </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p>Also here we are adding code for <code>evaluation_python_model.py</code>:</p> <p></p><div id="gist73527649" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-evaluation_python_model-py" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python" style="overflow: auto" tabindex="0" role="region" aria-label="evaluation_python_model.py content, created by Zoldin on 06:54AM on August 02, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="evaluation_python_model.py"> <tbody><tr> <td id="file-evaluation_python_model-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-evaluation_python_model-py-LC1" class="blob-code blob-code-inner js-file-line">from sklearn.metrics import precision_recall_curve</td> </tr> <tr> <td id="file-evaluation_python_model-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-evaluation_python_model-py-LC2" class="blob-code blob-code-inner js-file-line">import sys</td> </tr> <tr> <td id="file-evaluation_python_model-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-evaluation_python_model-py-LC3" class="blob-code blob-code-inner js-file-line">import sklearn.metrics as metrics</td> </tr> <tr> <td id="file-evaluation_python_model-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-evaluation_python_model-py-LC4" class="blob-code blob-code-inner js-file-line">from scipy import sparse</td> </tr> <tr> <td id="file-evaluation_python_model-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-evaluation_python_model-py-LC5" class="blob-code blob-code-inner js-file-line">from numpy import loadtxt</td> </tr> <tr> <td id="file-evaluation_python_model-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-evaluation_python_model-py-LC6" class="blob-code blob-code-inner js-file-line">try: import cPickle as pickle # python2</td> </tr> <tr> <td id="file-evaluation_python_model-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-evaluation_python_model-py-LC7" class="blob-code blob-code-inner js-file-line">except: import pickle # python3</td> </tr> <tr> <td id="file-evaluation_python_model-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-evaluation_python_model-py-LC8" class="blob-code blob-code-inner js-file-line">import feather as ft</td> </tr> <tr> <td id="file-evaluation_python_model-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-evaluation_python_model-py-LC9" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluation_python_model-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-evaluation_python_model-py-LC10" class="blob-code blob-code-inner js-file-line">if len(sys.argv) != 4:</td> </tr> <tr> <td id="file-evaluation_python_model-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-evaluation_python_model-py-LC11" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('Arguments error. Usage:\n')</td> </tr> <tr> <td id="file-evaluation_python_model-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-evaluation_python_model-py-LC12" class="blob-code blob-code-inner js-file-line"> sys.stderr.write('\tpython metrics.py MODEL_FILE TEST_MATRIX METRICS_FILE\n')</td> </tr> <tr> <td id="file-evaluation_python_model-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-evaluation_python_model-py-LC13" class="blob-code blob-code-inner js-file-line"> sys.exit(1)</td> </tr> <tr> <td id="file-evaluation_python_model-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-evaluation_python_model-py-LC14" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluation_python_model-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-evaluation_python_model-py-LC15" class="blob-code blob-code-inner js-file-line">model_file = sys.argv[1]</td> </tr> <tr> <td id="file-evaluation_python_model-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-evaluation_python_model-py-LC16" class="blob-code blob-code-inner js-file-line">test_matrix_file = sys.argv[2]</td> </tr> <tr> <td id="file-evaluation_python_model-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-evaluation_python_model-py-LC17" class="blob-code blob-code-inner js-file-line">metrics_file = sys.argv[3]</td> </tr> <tr> <td id="file-evaluation_python_model-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-evaluation_python_model-py-LC18" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluation_python_model-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-evaluation_python_model-py-LC19" class="blob-code blob-code-inner js-file-line">with open(model_file, 'rb') as fd:</td> </tr> <tr> <td id="file-evaluation_python_model-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-evaluation_python_model-py-LC20" class="blob-code blob-code-inner js-file-line"> model = pickle.load(fd)</td> </tr> <tr> <td id="file-evaluation_python_model-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-evaluation_python_model-py-LC21" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluation_python_model-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-evaluation_python_model-py-LC22" class="blob-code blob-code-inner js-file-line">df = ft.read_dataframe(test_matrix_file)</td> </tr> <tr> <td id="file-evaluation_python_model-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-evaluation_python_model-py-LC23" class="blob-code blob-code-inner js-file-line">labels = df.loc[:,'label']</td> </tr> <tr> <td id="file-evaluation_python_model-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-evaluation_python_model-py-LC24" class="blob-code blob-code-inner js-file-line">x = df.loc[:, df.columns != 'label']</td> </tr> <tr> <td id="file-evaluation_python_model-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td> <td id="file-evaluation_python_model-py-LC25" class="blob-code blob-code-inner js-file-line">predictions_by_class = model.predict_proba(x)</td> </tr> <tr> <td id="file-evaluation_python_model-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td> <td id="file-evaluation_python_model-py-LC26" class="blob-code blob-code-inner js-file-line">predictions = predictions_by_class[:,1]</td> </tr> <tr> <td id="file-evaluation_python_model-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td> <td id="file-evaluation_python_model-py-LC27" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluation_python_model-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td> <td id="file-evaluation_python_model-py-LC28" class="blob-code blob-code-inner js-file-line">precision, recall, thresholds = precision_recall_curve(labels.ix[:,0], predictions)</td> </tr> <tr> <td id="file-evaluation_python_model-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td> <td id="file-evaluation_python_model-py-LC29" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluation_python_model-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td> <td id="file-evaluation_python_model-py-LC30" class="blob-code blob-code-inner js-file-line">auc = metrics.auc(recall, precision)</td> </tr> <tr> <td id="file-evaluation_python_model-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td> <td id="file-evaluation_python_model-py-LC31" class="blob-code blob-code-inner js-file-line">#print('AUC={}'.format(metrics.auc(recall, precision)))</td> </tr> <tr> <td id="file-evaluation_python_model-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td> <td id="file-evaluation_python_model-py-LC32" class="blob-code blob-code-inner js-file-line">with open(metrics_file, 'w') as fd:</td> </tr> <tr> <td id="file-evaluation_python_model-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td> <td id="file-evaluation_python_model-py-LC33" class="blob-code blob-code-inner js-file-line"> fd.write('AUC: {:4f}\n'.format(auc))</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/Zoldin/9eef13632d0a9039fe9b0dba376516a4/raw/8b8837f0d5640e0c208ea1c4910d655d933b9bd0/evaluation_python_model.py" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/Zoldin/9eef13632d0a9039fe9b0dba376516a4#file-evaluation_python_model-py" class="Link--inTextBlock"> evaluation_python_model.py </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p>Let’s download necessary R and Python codes from above (clone the <a href="https://github.com/Zoldin/R_AND_DVC" target="_blank" rel="nofollow noopener noreferrer">Github</a> repository):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">mkdir</span> R_DVC_GITHUB_CODE </span><span class="token line"><span class="token input">$ </span><span class="token command">cd</span> R_DVC_GITHUB_CODE </span> <span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> https://github.com/Zoldin/R_AND_DVC</span></code></pre></div> <p>Our dependency graph of this data science project look like this:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 250.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fbd7192868b16c9a421107083e2dd45b/09eb0/our-dependency-graph.png" alt="R (marked red) and Python (marked pink) jobs in one project" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>R (marked red) and Python (marked pink) jobs in one project</em></p> <p>Now lets see how it is possible to speed up and simplify process flow with Feather API and data version control reproducibility.</p> <h2 id="feather-api" style="position:relative;">Feather API<a href="#feather-api" aria-label="feather api permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Feather API is designed to improve meta data and data interchange between R and Python. It provides fast import/export of data frames among both environments and keeps meta data information which is an improvement over data exchange via csv/txt file format. In our example Python job will read an input binary file that was produced in R with Feather api.</p> <p>Let’s install Feather library in both environments.</p> <p>For Python 3 on linux environment you can use cmd and pip3:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">sudo</span> pip3 <span class="token function">install</span> feather-format</span></code></pre></div> <p>For R it is necessary to install feather package:</p> <div class="gatsby-highlight" data-language="r"><pre class="language-r"><code class="language-r">install.packages<span class="token punctuation">(</span>feather<span class="token punctuation">)</span></code></pre></div> <p>After successful installation we can use Feather for data exchange.</p> <p>Below is an R syntax for data frame export with Feather (featurization.R):</p> <div class="gatsby-highlight" data-language="r"><pre class="language-r"><code class="language-r">library<span class="token punctuation">(</span>feather<span class="token punctuation">)</span> write_feather<span class="token punctuation">(</span>dtm_train_tfidf<span class="token punctuation">,</span>args<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span><span class="token punctuation">)</span> write_feather<span class="token punctuation">(</span>dtm_test_tfidf<span class="token punctuation">,</span>args<span class="token punctuation">[</span><span class="token number">4</span><span class="token punctuation">]</span><span class="token punctuation">)</span> print<span class="token punctuation">(</span><span class="token string">"Two data frame were created with Feather - one for train and one for test data set"</span><span class="token punctuation">)</span></code></pre></div> <p>Python syntax for reading feather input binary files (train_model_python.py):</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python"><span class="token keyword">import</span> feather <span class="token keyword">as</span> ft <span class="token builtin">input</span> <span class="token operator">=</span> sys<span class="token punctuation">.</span>argv<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> df <span class="token operator">=</span> ft<span class="token punctuation">.</span>read_dataframe<span class="token punctuation">(</span><span class="token builtin">input</span><span class="token punctuation">)</span></code></pre></div> <h2 id="dependency-graph-with-r-and-python-combined" style="position:relative;">Dependency graph with R and Python combined<a href="#dependency-graph-with-r-and-python-combined" aria-label="dependency graph with r and python combined permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The next question what we are asking ourselves is why do we need DVC, why not just use shell scripting? DVC automatically derives the dependencies between the steps and builds <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph" target="_blank" rel="nofollow noopener noreferrer">the dependency graph (DAG)</a> transparently to the user. Graph is used for reproducing parts/codes of your pipeline which were affected by recent changes and we don’t have to think all the time what we need to repeat (which steps) with the latest changes.</p> <p>Firstly, with <code>dvc run</code> command we will execute all jobs that are related to our model development. In that phase DVC creates dependencies that will be used in the reproducibility phase:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://s3-us-west-2.amazonaws.com/dvc-public/data/tutorial/nlp/25K/Posts.xml.zip <span class="token punctuation">\</span> data/ </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token function">tar</span> zxf data/Posts.xml.tgz <span class="token parameter variable">-C</span> data/ </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/parsingxml.R <span class="token punctuation">\</span> data/Posts.xml data/Posts.csv </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span> data/Posts.csv <span class="token number">0.33</span> <span class="token number">20170426</span> <span class="token punctuation">\</span> data/train_post.csv data/test_post.csv </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/featurization.R <span class="token punctuation">\</span> data/train_post.csv <span class="token punctuation">\</span> data/test_post.csv data/matrix_train.feather <span class="token punctuation">\</span> data/matrix_test.feather </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python3 code/train_model_python.py <span class="token punctuation">\</span> data/matrix_train.feather <span class="token punctuation">\</span> <span class="token number">20170426</span> data/model.p </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python3 code/evaluate_python_mdl.py <span class="token punctuation">\</span> data/model.p data/matrix_test.feather <span class="token punctuation">\</span> data/evaluation_python.txt</span></code></pre></div> <p>After this commands jobs are executed and included in DAG graph. Result (AUC metrics) is written in evaluation_python.txt file:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation_python.txt </span>AUC: 0.741432</code></pre></div> <p>It is possible to improve our result with random forest algorithm.</p> <p>We can increase number of trees in the random forest classifier — from 100 to 500:</p> <div class="gatsby-highlight" data-language="python"><pre class="language-python"><code class="language-python">clf <span class="token operator">=</span> RandomForestClassifier<span class="token punctuation">(</span>n_estimators<span class="token operator">=</span><span class="token number">500</span><span class="token punctuation">,</span> n_jobs<span class="token operator">=</span><span class="token number">2</span><span class="token punctuation">,</span> random_state<span class="token operator">=</span>seed<span class="token punctuation">)</span> clf<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>x<span class="token punctuation">,</span> labels<span class="token punctuation">)</span></code></pre></div> <p>After commited changes (in <code>train_model_python.py</code>) with <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command all necessary jobs for <code>evaluation_python.txt</code> reproduction will be re-executed. We don’t need to worry which jobs to run and in which order.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token git">git add</span> <span class="token builtin class-name">.</span> </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> </span>[master a65f346] Random forest classifier — more trees added 1 file changed, 1 insertion(+), 1 deletion(-) <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/evaluation_python.txt </span> Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt Data item “data/evaluation_python.txt” was reproduced.</code></pre></div> <p>Beside code versioning, DVC also cares about data versioning. For example, if we change data sets <code>train_post.csv</code> and <code>test_post.csv</code> (use different splitting ratio) DVC will know that data sets are changed and <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> will re-execute all necessary jobs for evaluation_python.txt.</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span> data/Posts.csv <span class="token number">0.15</span> <span class="token number">20170426</span> <span class="token punctuation">\</span> data/train_post.csv <span class="token punctuation">\</span> data/test_post.csv</span></code></pre></div> <p>Re-executed jobs are marked with red color:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 250.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/10053d985ed8b13cfb9b560ee5d2cc37/09eb0/re-executed-jobs.png" alt="re executed jobs" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span> data/Posts.csv <span class="token number">0.15</span> <span class="token number">20170426</span> <span class="token punctuation">\</span> data/train_post.csv <span class="token punctuation">\</span> data/test_post.csv </span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/evaluation_python.txt </span> Reproducing run command for data item data/matrix_train.txt. Args: Rscript — vanilla code/featurization.R data/train_post.csv data/test_post.csv data/matrix_train.txt data/matrix_test.txt Reproducing run command for data item data/model.p. Args: python3 code/train_model_python.py data/matrix_train.txt 20170426 data/model.p Reproducing run command for data item data/evaluation_python.txt. Args: python3 code/evaluate_python_mdl.py data/model.p data/matrix_test.txt data/evaluation_python.txt Data item “data/evaluation_python.txt” was reproduced. <span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation_python.txt </span>AUC: 0.793145</code></pre></div> <p>New AUC result is 0.793145 which shows an improvement compared to previous iteration.</p> <h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In data science projects it is often used R/Python combined programming. Additional tools beside git and shell scripting are developed to facilitate the development of predictive model in a multi-language environments. Using data version control system for reproducibility and Feather for data interoperability helps you orchestrate R and Python code in a single environment.</p>https://dvc.org/blog/ml-model-ensembling-with-fast-iterationshttps://dvc.org/blog/ml-model-ensembling-with-fast-iterationsWed, 23 Aug 2017 00:00:00 GMT<p>In a model ensembling setup, the final prediction is a composite of predictions from individual machine learning algorithms. To make the best model composite, you have to try dozens of combinations of weights for the model set. It takes a lot of time to come up with the best one. That is why the iteration speed is crucial in the ML model ensembling. We are going to make our research reproducible by using <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">Data Version Control</a> tool - (<a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a>). It provides the ability to quickly re-run and replicate the ML prediction result by executing just a single command <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a>.</p> <p>As we will demonstrate, DVC is a good tool that helps tackling common technical challenges of building pipelines for the ensemble learning.</p> <h2 id="project-overview" style="position:relative;">Project Overview<a href="#project-overview" aria-label="project overview permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In this case, we will build an R-based solution to attack the supervised-learning regression problem to predict win sales per <a href="https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/" target="_blank" rel="nofollow noopener noreferrer">Predict Wine Sales</a> Kaggle competition.</p> <p>An ensemble prediction methodology will be used in the project. The weighted ensemble of three models will be implemented, trained, and predicted from (namely, these are Linear Regression, <code>GBM</code>, and <code>XGBoost</code>).</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 435px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/eb9050a712d4a3f7fd006686b1f41fe2/39600/ensemble-prediction-methodology.png" alt="ensemble prediction methodology" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>If properly designed and used, ensemble prediction can perform much better then predictions of individual machine learning models composing the ensemble.</p> <p>Prediction results will be delivered in a format of output CSV file that is specified in the requirements to the <a href="https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/" target="_blank" rel="nofollow noopener noreferrer">Predict Wine Sales</a> Kaggle competition (so called Kaggle submission file).</p> <h2 id="important-pre-requisites" style="position:relative;">Important Pre-Requisites<a href="#important-pre-requisites" aria-label="important pre requisites permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In order to try the materials of this <a href="https://github.com/gvyshnya/DVC_R_Ensemble" target="_blank" rel="nofollow noopener noreferrer">repository</a> in your environment, the following software should be installed on your machine</p> <ul> <li> <p><strong><em>Python 3</em></strong> runtime environment for your OS (it is required to run DVC commands in the batch files)</p> </li> <li> <p><strong><em>DVC</em></strong> itself (you can install it as a python package by simply doing the standard command in your command line prompt: <code>pip install dvc</code>)</p> </li> <li> <p><strong><em>R</em></strong> <strong><em>3.4.x</em></strong> runtime environment for your OS</p> </li> <li> <p><strong><em>git</em></strong> command-line client application for your OS</p> </li> </ul> <h2 id="technical-challenges" style="position:relative;">Technical Challenges<a href="#technical-challenges" aria-label="technical challenges permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The technical challenges of building the ML pipeline for this project were to meet business requirements below</p> <ul> <li> <p>Ability to conditionally trigger execution of 3 different ML prediction models</p> </li> <li> <p>Ability to conditionally trigger model ensemble prediction based on predictions of those 3 individual models</p> </li> <li> <p>Ability to specify weights of each of the individual model predictions in the ensemble</p> </li> <li> <p>Quick and fast redeployment and re-run of the ML pipeline upon frequent reconfiguration and model tweaks</p> </li> <li> <p>Reproducibility of the pipeline and forecasting results across the multiple machines and team members</p> </li> </ul> <p>The next sections below will explain how these challenges are addressed in the design of ML pipeline for this project.</p> <h2 id="ml-pipeline" style="position:relative;">ML Pipeline<a href="#ml-pipeline" aria-label="ml pipeline permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The ML pipeline for this project is presented in the diagram below</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 365.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/9cf20fd774b97331a5c6e17a1e92115b/39600/ml-pipeline.png" alt="ml pipeline" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>As you can see, the essential implementation of the solution is as follows</p> <ul> <li> <p><a href="https://gist.github.com/gvyshnya/443424775b0150baac774cc6cf3cb1cc" target="_blank" rel="nofollow noopener noreferrer"><code>preprocessing.R</code></a> handles all aspects of data manipulations and pre-processing (reading training and testing data sets, removing outliers, imputing NAs etc.) as well as stores refined training and testing set data as new files to reuse by model scripts</p> </li> <li> <p>3 model scripts implement training and forecasting algorithms for each of the models selected for this project (<a href="https://gist.github.com/gvyshnya/7ec76316c24bc1b4f595ef1256f52d3a" target="_blank" rel="nofollow noopener noreferrer"><code>LR.R</code></a>, <a href="https://gist.github.com/gvyshnya/50e5ea3efa9771d2e7cc121c2f1a04e4" target="_blank" rel="nofollow noopener noreferrer"><code>GBM.R</code></a>, <a href="https://gist.github.com/gvyshnya/2e5799863f02fec652c194020da82dd3" target="_blank" rel="nofollow noopener noreferrer"><code>xgboost.R</code></a>)</p> </li> <li> <p><a href="https://gist.github.com/gvyshnya/84379d6a68fd085fe3a26aabad453e55" target="_blank" rel="nofollow noopener noreferrer"><code>ensemble.R</code></a> is responsible for the weighted ensemble prediction and the final output of the Kaggle submission file</p> </li> <li> <p><code>config.R</code> is responsible for all of the conditional logic switches needed in the pipeline (it is included as a source to all of modeling and ensemble prediction scripts, to get this done)</p> </li> </ul> <p>There is a special note about lack of feature engineering for this project. It was an intended specification related to the specifics of the dataset. The existing features were quite instrumental to predict the target values ‘as is’. Therefore it had been decided to follow the well-known <a href="https://en.wikipedia.org/wiki/Pareto_principle" target="_blank" rel="nofollow noopener noreferrer">Pareto principle</a> (interpreted as “<strong><em>20% of efforts address 80% of issues</em></strong>”, in this case) and not to spend more time on it.</p> <p><strong><em>Note</em></strong>: all <code>R</code> and batch files mentioned throughout this blog post are available online in a separate GitHub <a href="https://github.com/gvyshnya/DVC_R_Ensemble" target="_blank" rel="nofollow noopener noreferrer">repository</a>. You will be also able to review more details on the implementation of each of the machine learning prediction models there.</p> <h3 id="pipeline-configuration-management" style="position:relative;">Pipeline Configuration Management<a href="#pipeline-configuration-management" aria-label="pipeline configuration management permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>All of the essential tweaks to conditional machine learning pipeline for this project is managed by a configuration file. For ease of its use across solution, it was implemented as an R code module (<code>config.R</code>), to be included to all model training and forecasting. Thus the respective parameters (assigned as R variables) will be retrieved by the runnable scripts, and the conditional logic there will be triggered respectively.</p> <p>This file is not intended to run from a command line (unlike the rest of the R scripts in the project).</p> <p></p><div id="gist73938264" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-config-r" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="config.R content, created by gvyshnya on 03:27PM on August 06, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="config.R"> <tbody><tr> <td id="file-config-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-config-r-LC1" class="blob-code blob-code-inner js-file-line"># Competition: https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/</td> </tr> <tr> <td id="file-config-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-config-r-LC2" class="blob-code blob-code-inner js-file-line"># This is a configuration file to the entire solution </td> </tr> <tr> <td id="file-config-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-config-r-LC3" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-config-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-config-r-LC4" class="blob-code blob-code-inner js-file-line"># LR.R specific settings</td> </tr> <tr> <td id="file-config-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-config-r-LC5" class="blob-code blob-code-inner js-file-line">cfg_run_LR <- 1 # if set to 0, LR model will not fit, and its prediction will not be calculated in the batch mode</td> </tr> <tr> <td id="file-config-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-config-r-LC6" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-config-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-config-r-LC7" class="blob-code blob-code-inner js-file-line"># GMB.R specific settings</td> </tr> <tr> <td id="file-config-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-config-r-LC8" class="blob-code blob-code-inner js-file-line">cfg_run_GBM <- 1 # if set to 0, GBM model will not fit, and its prediction will not be calculated in the batch mode</td> </tr> <tr> <td id="file-config-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-config-r-LC9" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-config-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-config-r-LC10" class="blob-code blob-code-inner js-file-line"># xgboost.R specific settings</td> </tr> <tr> <td id="file-config-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-config-r-LC11" class="blob-code blob-code-inner js-file-line">cfg_run_xgboost <- 1 # if set to 0, xgboost model will not fit, and its prediction will not be calculated in the batch mode</td> </tr> <tr> <td id="file-config-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-config-r-LC12" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-config-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-config-r-LC13" class="blob-code blob-code-inner js-file-line"># ensemble.R specific settings</td> </tr> <tr> <td id="file-config-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-config-r-LC14" class="blob-code blob-code-inner js-file-line">cfg_run_ensemble <- 1 # if set to 0, the ensemble will not predict, and ensemble prediction will not be created</td> </tr> <tr> <td id="file-config-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-config-r-LC15" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-config-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-config-r-LC16" class="blob-code blob-code-inner js-file-line"># ensemble components</td> </tr> <tr> <td id="file-config-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-config-r-LC17" class="blob-code blob-code-inner js-file-line">cfg_model_predictions <- c("data/submission_LR.csv", "data/submission_GBM.csv", "data/submission_XGBOOST.csv")</td> </tr> <tr> <td id="file-config-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-config-r-LC18" class="blob-code blob-code-inner js-file-line"># element weights mapped to the cfg_model_predictions elements above</td> </tr> <tr> <td id="file-config-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-config-r-LC19" class="blob-code blob-code-inner js-file-line">cfg_model_weights <- c(1,1,1) # weights of predictions of the models in the ensemble</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/gvyshnya/918e94b06ebf222f6bb56ed26a5f44ee/raw/e274919657607fdfd67a2fb6354e40ff0c4173e9/config.R" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/gvyshnya/918e94b06ebf222f6bb56ed26a5f44ee#file-config-r" class="Link--inTextBlock"> config.R </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <h3 id="why-do-we-need-dvc" style="position:relative;">Why Do We Need DVC?<a href="#why-do-we-need-dvc" aria-label="why do we need dvc permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>As we all know, there is no way to build the ideal ML model with sound prediction accuracy from the very beginning. You will have to continuously adjust your algorithm/model implementations based on the cross-validation appraisal until you yield the blooming results. This is especially true in the ensemble learning where you have to constantly tweak not only parameters of the individual prediction models but also the settings of the ensemble itself</p> <ul> <li> <p>changing ensemble composition — adding or removing individual prediction models</p> </li> <li> <p>changing model prediction weights in the resulting ensemble prediction</p> </li> </ul> <p>Under such a condition, DVC will help you to manage your ensemble ML pipeline in a really solid manner. Let’s consider the following real-world scenario</p> <ul> <li> <p>Your team member changes the settings of <code>GBM</code> model and resubmit its implementation to (this is emulated by the commit <a href="https://github.com/gvyshnya/DVC_R_Ensemble/commit/27825d0732f72f07e7e4e48548ddb8a8604103f0" target="_blank" rel="nofollow noopener noreferrer">#8604103f0</a>, check sum <code>27825d0</code>)</p> </li> <li> <p>You rerun the entire ML pipeline on your computer, to get the newest predictions from <code>GBM</code> as well as the updated final ensemble prediction</p> </li> <li> <p>The results of the prediction appeared to be still not optimal thus someone changes the weights of individual models in the ensemble, assigning <code>GBM</code> higher weight vs. <code>xgboost</code> and <code>LR</code></p> </li> <li> <p>After the ensemble setup changes committed (and updated <code>config.R</code> appeared in the repository, as emulated by the commit <a href="https://github.com/gvyshnya/DVC_R_Ensemble/commit/5bcbe115afcb24886abb4734ff2da42eb97612ce" target="_blank" rel="nofollow noopener noreferrer">#eb97612ce</a>, check sum <code>5bcbe11</code>), you re-run the model predictions and the final ensemble prediction on your machine once again</p> </li> </ul> <p>All that you need to do to handle the changes above is simply to keep running your <strong>DVC</strong> commands per the script developed (see the section below). You do not have to remember or know explicitly the changes being made into the project codebase or its pipeline configuration. <strong>DVC</strong> will automatically check out latest changes from the repo as well as make sure it runs only those steps in the pipeline that were affected by the recent changes in the code modules.</p> <h3 id="orchestrating-the-pipeline--dvc-command-file" style="position:relative;">Orchestrating the Pipeline : DVC Command File<a href="#orchestrating-the-pipeline--dvc-command-file" aria-label="orchestrating the pipeline dvc command file permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h3> <p>After we developed individual R scripts needed by different steps of our Machine Learning pipeline, we orchestrate it together using DVC.</p> <p>Below is a batch file illustrating how DVC manages steps of the machine learning process for this project</p> <p></p><div id="gist73940214" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-dvc-bat" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-batchfile" style="overflow: auto" tabindex="0" role="region" aria-label="dvc.bat content, created by gvyshnya on 04:05PM on August 06, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="dvc.bat"> <tbody><tr> <td id="file-dvc-bat-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-dvc-bat-LC1" class="blob-code blob-code-inner js-file-line"># This is a DVC-based script to manage machine-learning pipeline for a project per</td> </tr> <tr> <td id="file-dvc-bat-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-dvc-bat-LC2" class="blob-code blob-code-inner js-file-line"># https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/</td> </tr> <tr> <td id="file-dvc-bat-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-dvc-bat-LC3" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-dvc-bat-LC4" class="blob-code blob-code-inner js-file-line">mkdir R_DVC_GITHUB_CODE</td> </tr> <tr> <td id="file-dvc-bat-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-dvc-bat-LC5" class="blob-code blob-code-inner js-file-line">cd R_DVC_GITHUB_CODE</td> </tr> <tr> <td id="file-dvc-bat-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-dvc-bat-LC6" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-dvc-bat-LC7" class="blob-code blob-code-inner js-file-line"># clone the github repo with the code</td> </tr> <tr> <td id="file-dvc-bat-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-dvc-bat-LC8" class="blob-code blob-code-inner js-file-line">git clone https://github.com/gvyshnya/DVC_R_Ensemble</td> </tr> <tr> <td id="file-dvc-bat-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-dvc-bat-LC9" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-dvc-bat-LC10" class="blob-code blob-code-inner js-file-line"># initialize DVC</td> </tr> <tr> <td id="file-dvc-bat-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-dvc-bat-LC11" class="blob-code blob-code-inner js-file-line">$ dvc init</td> </tr> <tr> <td id="file-dvc-bat-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-dvc-bat-LC12" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-dvc-bat-LC13" class="blob-code blob-code-inner js-file-line"># import data</td> </tr> <tr> <td id="file-dvc-bat-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-dvc-bat-LC14" class="blob-code blob-code-inner js-file-line">$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine.csv data/</td> </tr> <tr> <td id="file-dvc-bat-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-dvc-bat-LC15" class="blob-code blob-code-inner js-file-line">$ dvc import https://inclass.kaggle.com/c/pred-411-2016-04-u3-wine/download/wine_test.csv data/</td> </tr> <tr> <td id="file-dvc-bat-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-dvc-bat-LC16" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-dvc-bat-LC17" class="blob-code blob-code-inner js-file-line"># run data pre-processing</td> </tr> <tr> <td id="file-dvc-bat-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-dvc-bat-LC18" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/preprocessing.R data/wine.csv data/wine_test.csv data/training_imputed.csv data/testing_imputed.csv</td> </tr> <tr> <td id="file-dvc-bat-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-dvc-bat-LC19" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-dvc-bat-LC20" class="blob-code blob-code-inner js-file-line"># run LR model fit and forecasting</td> </tr> <tr> <td id="file-dvc-bat-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-dvc-bat-LC21" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/LR.R data/training_imputed.csv data/testing_imputed.csv 0.7 825 data/submission_LR.csv code/config.R</td> </tr> <tr> <td id="file-dvc-bat-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-dvc-bat-LC22" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-dvc-bat-LC23" class="blob-code blob-code-inner js-file-line"># run GBM model fit and forecasting</td> </tr> <tr> <td id="file-dvc-bat-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-dvc-bat-LC24" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv 5000 10 4 25 data/submission_GBM.csv code/config.R</td> </tr> <tr> <td id="file-dvc-bat-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td> <td id="file-dvc-bat-LC25" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td> <td id="file-dvc-bat-LC26" class="blob-code blob-code-inner js-file-line"># rum XGBOOST model fit and forecasting</td> </tr> <tr> <td id="file-dvc-bat-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td> <td id="file-dvc-bat-LC27" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/GBM.R data/training_imputed.csv data/testing_imputed.csv 1000 10 0.0001 1.0 data/submission_xgboost.csv code/config.R</td> </tr> <tr> <td id="file-dvc-bat-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td> <td id="file-dvc-bat-LC28" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-bat-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td> <td id="file-dvc-bat-LC29" class="blob-code blob-code-inner js-file-line"># prepare ensemble submission</td> </tr> <tr> <td id="file-dvc-bat-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td> <td id="file-dvc-bat-LC30" class="blob-code blob-code-inner js-file-line"># Note: please make sure to edit your code/config.R to set up the references to the predictions from each model according</td> </tr> <tr> <td id="file-dvc-bat-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td> <td id="file-dvc-bat-LC31" class="blob-code blob-code-inner js-file-line"># to the names of output files on the steps above</td> </tr> <tr> <td id="file-dvc-bat-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td> <td id="file-dvc-bat-LC32" class="blob-code blob-code-inner js-file-line">$ dvc run Rscript --vanilla code/ensemble.R data/submission_ensemble.csv code/config.R</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644/raw/4818eab6c2f99722110a37c7d2c509c78ce4240a/dvc.bat" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644#file-dvc-bat" class="Link--inTextBlock"> dvc.bat </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p>If you then further edit ensemble configuration setup in <code>code/config.R</code>, you can simply leverage the power of DVC as for automatic dependencies resolving and tracking to rebuild the new ensemble prediction as follows</p> <p></p><div id="gist74997297" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-dvc-repro-code" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-text" style="overflow: auto" tabindex="0" role="region" aria-label="dvc repro code content, created by gvyshnya on 07:22PM on August 20, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="dvc repro code"> <tbody><tr> <td id="file-dvc-repro-code-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-dvc-repro-code-LC1" class="blob-code blob-code-inner js-file-line"># Improve ensemble configuration</td> </tr> <tr> <td id="file-dvc-repro-code-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-dvc-repro-code-LC2" class="blob-code blob-code-inner js-file-line">$ vi code/config.R</td> </tr> <tr> <td id="file-dvc-repro-code-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-dvc-repro-code-LC3" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-repro-code-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-dvc-repro-code-LC4" class="blob-code blob-code-inner js-file-line"># Commit all the changes.</td> </tr> <tr> <td id="file-dvc-repro-code-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-dvc-repro-code-LC5" class="blob-code blob-code-inner js-file-line">$ git commit -am "Updated weights of the models in the ensemble"</td> </tr> <tr> <td id="file-dvc-repro-code-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-dvc-repro-code-LC6" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc-repro-code-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-dvc-repro-code-LC7" class="blob-code blob-code-inner js-file-line"># Reproduce the ensemble prediction</td> </tr> <tr> <td id="file-dvc-repro-code-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-dvc-repro-code-LC8" class="blob-code blob-code-inner js-file-line">$ dvc repro data/submission_ensemble.csv</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee/raw/4367adacf7f6d78ad223289c52737588441fabcb/dvc%20repro%20code" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee#file-dvc-repro-code" class="Link--inTextBlock"> dvc repro code </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In this blog post, we worked through the process of building an ensemble prediction pipeline using DVC. The essential key features of that pipeline were as follows</p> <ul> <li> <p><strong><em>reproducibility</em></strong> — everybody on a team can run it on their premise</p> </li> <li> <p><strong><em>separation of data and code</em></strong> — this ensured everyone always runs the latest versions of the pipeline jobs with the most up-to-date ‘golden copy’ of training and testing data sets</p> </li> </ul> <p>The helpful side effect of using DVC was you stop keeping in mind what was changed on every step of modifying your project scripts or in the pipeline configuration. Due to it maintaining the dependencies graph (DAG) automatically, it automatically triggered the only steps that were affected by the particular changes, within the pipeline job setup. It, in turn, provides the capability to quickly iterate through the entire ML pipeline.</p> <blockquote> <p>As DVC brings proven engineering practices to often suboptimal and messy ML processes as well as helps a typical Data Science project team to eliminate a big chunk of common <a href="https://blog.dataversioncontrol.com/data-version-control-in-analytics-devops-paradigm-35a880e99133" target="_blank" rel="nofollow noopener noreferrer">DevOps overheads</a>, I found it extremely useful to leverage DVC on the industrial data science and predictive analytics projects.</p> </blockquote> <h2 id="further-reading" style="position:relative;">Further Reading<a href="#further-reading" aria-label="further reading permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <ol> <li> <p><a href="https://en.wikipedia.org/wiki/Ensemble_learning" target="_blank" rel="nofollow noopener noreferrer">Ensemble Learning and Prediction Introduction</a></p> </li> <li> <p><a href="https://blog.dataversioncontrol.com/data-version-control-beta-release-iterative-machine-learning-a7faf7c8be67" target="_blank" rel="nofollow noopener noreferrer">Using DVC in Machine Learning projects in Python</a></p> </li> <li> <p><a href="https://blog.dataversioncontrol.com/r-code-and-reproducible-model-development-with-dvc-1507a0e3687b" target="_blank" rel="nofollow noopener noreferrer">Using DVC in Machine Learning projects in R</a></p> </li> <li> <p><a href="https://mlwave.com/kaggle-ensembling-guide/" target="_blank" rel="nofollow noopener noreferrer">Kaggle Ensembling Guide</a></p> </li> </ol>https://dvc.org/blog/data-version-control-in-analytics-devops-paradigmhttps://dvc.org/blog/data-version-control-in-analytics-devops-paradigmThu, 27 Jul 2017 00:00:00 GMT<h2 id="data-science-and-devops-convergence" style="position:relative;">Data Science and DevOps Convergence<a href="#data-science-and-devops-convergence" aria-label="data science and devops convergence permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The primary mission of DevOps is to help the teams to resolve various Tech Ops infrastructure, tools and pipeline issues.</p> <p>At the other hand, as mentioned in the conceptual review by <a href="https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/" target="_blank" rel="nofollow noopener noreferrer">Forbes</a> in November 2016, the industrial analytics is no more going to be driven by data scientists alone. It requires an investment in DevOps skills, practices and supporting technology to move analytics out of the lab and into the business. There are even <a href="https://www.computing.co.uk/ctg/news/2433095/a-lot-of-companies-will-stop-hiring-data-scientists-when-they-realise-that-the-majority-bring-no-value-says-data-scientist" target="_blank" rel="nofollow noopener noreferrer">voices</a> calling Data Scientists to concentrate on agile methodology and DevOps if they like to retain their jobs in business in the long run.</p> <h2 id="why-devops-matters" style="position:relative;">Why DevOps Matters<a href="#why-devops-matters" aria-label="why devops matters permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The eternal dream of almost every Data Scientist today is to spend all (well, almost all) the time in the office exploring new datasets, engineering decisive new features, inventing and validating cool new algorithms and strategies. However, reality is often different. One of the unfortunate daily routines of a Data Scientist work is to do raw data pre-processing. It usually translates to the challenges to</p> <ol> <li> <p><strong>Pull all kinds of necessary data from a variety of sources</strong></p> <ul> <li> <p>Internal data sources like ERP, CRM, POS systems, or data from online e-commerce platforms</p> </li> <li> <p>External data, like weather, public holidays, Google trends etc.</p> </li> </ul> </li> <li> <p><strong>Extract, transform, and load the data</strong></p> <ul> <li> <p>Relate and join the data sources</p> </li> <li> <p>Aggregate and transform the data</p> </li> </ul> </li> <li> <p><strong>Avoid technical and performance drawbacks</strong> when everything ends up in “one big table” at the end</p> </li> <li> <p><strong>Facilitate continuous machine learning and decision-making in a business-ready framework</strong></p> <ul> <li> <p>Utilize historic data to train the machine learning models and algorithms</p> </li> <li> <p>Use the current, up-to-date data for decision-making</p> </li> <li> <p>Export back the resulting decisions/recommendations to review by business stakeholders, either back into the ERP system or some other data warehouse</p> </li> </ul> </li> </ol> <p>Another big challenge is to organize <strong>collaboration and data/model sharing</strong> inside and across the boundaries of teams of Data Scientists and Software Engineers.</p> <p>DevOps skills as well as effective instruments will certainly be beneficial for industrial Data Scientists as they can address the above-mentioned challenges in a self-service manner.</p> <h2 id="can-dvc-be-a-solution" style="position:relative;">Can DVC Be a Solution?<a href="#can-dvc-be-a-solution" aria-label="can dvc be a solution permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">Data Version Control</a> or simply DVC comes to the scene whenever you start looking for effective DevOps-for-Analytics instruments.</p> <p>DVC is an open source tool for data science projects. It makes your data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data — through cloud storage (AWS S3, GCP) in a single DVC environment.</p> <blockquote> <p>Although DVC was created for machine learning developers and data scientists <a href="https://dvc.org/doc/understanding-dvc/what-is-dvc" target="_blank" rel="nofollow noopener noreferrer">originally</a>, it appeared to be useful beyond it. Since it brings proven engineering practices to not well defined ML process, I discovered it to have enormous potential as an Analytical DevOps instrument.</p> </blockquote> <p>It clearly helps to manage a big fraction of DevOps issues in daily Data Scientist routines</p> <ol> <li> <p><strong>Pull all kinds of necessary data from a variety of sources</strong>. Once you configure and script your data extraction jobs with DVC, it will be persistent and operable across your data and service infrastructure</p> </li> <li> <p><strong>Extract, transform, and load the data</strong>. ETL is going to be easy and repeatable once you configure it with DVC scripting. It will become a solid pipeline to operate without major supportive effort. Moreover, it will track all changes and trigger an alert for updates in the pipeline steps via DAG.</p> </li> <li> <p><strong>Facilitate continuous machine learning and decision-making.</strong> The part of the pipeline facilitated through DVC scripting can be jobs to upload data back to any transactional system (like ERP, ERM, CRM etc.), warehouse or data mart. It will then be exposed to business stakeholders to make intelligent data-driven decisions.</p> </li> <li> <p><strong>Share your algorithms and data</strong>. Machine Learning modeling is an iterative process and it is extremely important to keep track of your steps, dependencies between the steps, dependencies between your code and data files and all code running arguments. This becomes even more important and complicated in a team environment where data scientists’ collaboration takes a serious amount of the team’s effort. DVC will be the arm to help you with it.</p> </li> </ol> <p>One of the ‘juicy’ features of DVC is ability to support multiple technology stacks. Whether you prefer R or use promising Python-based implementations for your industrial data products, DVC will be able to support your pipeline properly. You can see it in action for both <a href="https://blog.dvc.org/how-data-scientists-can-improve-their-productivity" target="_blank" rel="nofollow noopener noreferrer">Python-based</a> and <a href="https://blog.dvc.org/r-code-and-reproducible-model-development-with-dvc" target="_blank" rel="nofollow noopener noreferrer">R-based</a> technical stacks.</p> <p>As such, DVC is going to be one of the tools you would enjoy to use if/when you embark on building continual analytical environment for your system or across your organization.</p> <h2 id="continual-analytical-environment-and-devops" style="position:relative;">Continual Analytical Environment and DevOps<a href="#continual-analytical-environment-and-devops" aria-label="continual analytical environment and devops permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Building a production pipeline is quite different from building a machine-learning prototype on a local laptop. Many teams and companies face the challenges there.</p> <p>At the bare minimum, the following requirements shall be met when you move your solution into production</p> <ol> <li> <p>Periodic re-training of the models/algorithms</p> </li> <li> <p>Ease of re-deployment and configuration changes in the system</p> </li> <li> <p>Efficiency and high performance of real-time scoring the new out-of-sample observations</p> </li> <li> <p>Availability of the monitor model performance over time</p> </li> <li> <p>Adaptive ETL and ability to manage new data feeds and transactional systems as data sources for AI and machine learning tools</p> </li> <li> <p>Scaling to really big data operations</p> </li> <li> <p>Security and Authorized access levels to different areas of the analytical systems</p> </li> <li> <p>Solid backup and recovery processes/tools</p> </li> </ol> <p>This goes into the territory traditionally inhabited by DevOps. Data Scientists should ideally learn to handle the part of those requirements themselves or at least be informative consultants to classical DevOps gurus.</p> <p>DVC can help in many aspects of the production scenario above as it can orchestrate relevant tools and instruments through its scripting. In such a setup, DVC scripts will be sharable manifestation (and implementation) of your production pipeline where each step can be transparently reviewed, easily maintained, and changed as needed over time.</p> <h2 id="will-devops-be-captivating" style="position:relative;">Will DevOps Be Captivating?<a href="#will-devops-be-captivating" aria-label="will devops be captivating permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>If you are further interested in understanding the ever-proliferating role of DevOps in the modern Data Science and predictive analytics in business, there are good resources for your review below</p> <ol> <li> <p><a href="https://www.forbes.com/sites/teradata/2016/11/14/devops-for-data-science-why-analytics-ops-is-key-to-value/" target="_blank" rel="nofollow noopener noreferrer">DevOps For Data Science: Why Analytics Ops Is Key To Value</a> (Forbes, Nov 14, 2016)</p> </li> <li> <p><a href="https://www.packtpub.com/books/content/bridging-gap-between-data-science-and-devops" target="_blank" rel="nofollow noopener noreferrer">Bridging the Gap Between Data Science and DevOps</a></p> </li> <li> <p><a href="https://devops.com/devops-life-better-data-scientists/" target="_blank" rel="nofollow noopener noreferrer">Is DevOps Making Life Better for Data Scientists?</a></p> </li> </ol> <p>By any mean, DVC is going to be a useful instrument to fill the multiple gaps between the classical in-lab old-school data science practices and growing demands of business to build solid DevOps processes and workflows to streamline mature and persistent data analytics.</p>https://dvc.org/blog/r-code-and-reproducible-model-development-with-dvchttps://dvc.org/blog/r-code-and-reproducible-model-development-with-dvcMon, 24 Jul 2017 00:00:00 GMT<p><a href="https://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC</a> or Data Version Control tool — its idea is to track files/data dependencies during model development in order to facilitate reproducibility and track data files versioning. Most of the <a href="https://dvc.org/doc/tutorials" target="_blank" rel="nofollow noopener noreferrer">DVC tutorials</a> provide good examples of using DVC with Python language. However, I realized that DVC is a <a href="https://en.wikipedia.org/wiki/Language-agnostic" target="_blank" rel="nofollow noopener noreferrer">language agnostic</a> tool and can be used with any programming language. In this blog post, we will see how to use DVC in R projects.</p> <h2 id="r-coding--keep-it-simple-and-readable" style="position:relative;">R coding — keep it simple and readable<a href="#r-coding--keep-it-simple-and-readable" aria-label="r coding keep it simple and readable permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Each development is always a combination of following steps presented below:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 342px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/fdf37f71d0c9ecd4d9f1b7f0ec446abf/921db/development-steps.png" alt="Model development process" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span> <em>Model development process</em></p> <p>Because of the specificity of the process — iterative development, it is very important to improve some coding and organizational skills. For example, instead of having one big R file with code it is better to split code in several logical files — each responsible for one small piece of work. It is smart to track history development with <a href="https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control" target="_blank" rel="nofollow noopener noreferrer">git</a> tool. Writing “<em>reusable code”</em> is nice skill to have. Put comments in a code can make our life easier.</p> <p>Beside git, next step in further improvements is to try out and work with DVC. Every time when a change/commit in some of the codes and data sets is made, DVC will reproduce new results with just one bash command on a linux (or Win environment). It memorizes dependencies among files and codes so it can easily repeat all necessary steps/codes instead of us worrying about the order.</p> <h2 id="r-example--data-and-code-clarification" style="position:relative;">R example — data and code clarification<a href="#r-example--data-and-code-clarification" aria-label="r example data and code clarification permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>We’ll take an Python example from <a href="https://dvc.org/doc/tutorials/deep" target="_blank" rel="nofollow noopener noreferrer">DVC tutorial</a> (written by Dmitry Petrov) and rewrite that code in R. With an example we’ll show how can DVC help during development and what are its possibilities.</p> <p>Firstly, let’s initialize git and dvc on mentioned example and run our codes for the first time. After that we will simulate some changes in the codes and see how DVC works on reproducibility.</p> <p>R codes can be downloaded from the <a href="https://github.com/Zoldin/R_AND_DVC" target="_blank" rel="nofollow noopener noreferrer">Github repository</a>. A brief explanation of the codes is presented below:</p> <p><strong>parsingxml.R</strong> — it takes xml that we downloaded from the web and creates appropriate csv file.</p> <p></p><div id="gist71114089" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-parsingxml-r" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="parsingxml.R content, created by Zoldin on 08:40PM on July 21, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="parsingxml.R"> <tbody><tr> <td id="file-parsingxml-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-parsingxml-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td> </tr> <tr> <td id="file-parsingxml-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-parsingxml-r-LC2" class="blob-code blob-code-inner js-file-line">library(XML)</td> </tr> <tr> <td id="file-parsingxml-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-parsingxml-r-LC3" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-parsingxml-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-parsingxml-r-LC4" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td> </tr> <tr> <td id="file-parsingxml-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-parsingxml-r-LC5" class="blob-code blob-code-inner js-file-line">if (!length(args)==2) {</td> </tr> <tr> <td id="file-parsingxml-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-parsingxml-r-LC6" class="blob-code blob-code-inner js-file-line"> stop("Two arguments must be supplied (input file name ,output file name - csv ext).n", call.=FALSE)</td> </tr> <tr> <td id="file-parsingxml-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-parsingxml-r-LC7" class="blob-code blob-code-inner js-file-line">} </td> </tr> <tr> <td id="file-parsingxml-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-parsingxml-r-LC8" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-parsingxml-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-parsingxml-r-LC9" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-parsingxml-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-parsingxml-r-LC10" class="blob-code blob-code-inner js-file-line">#read XML line by line</td> </tr> <tr> <td id="file-parsingxml-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-parsingxml-r-LC11" class="blob-code blob-code-inner js-file-line">con <- file(args[1], "r")</td> </tr> <tr> <td id="file-parsingxml-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-parsingxml-r-LC12" class="blob-code blob-code-inner js-file-line">lines <- readLines(con, -1)</td> </tr> <tr> <td id="file-parsingxml-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-parsingxml-r-LC13" class="blob-code blob-code-inner js-file-line">test <- lapply(lines,function(x){return(xmlTreeParse(x,useInternalNodes = TRUE))})</td> </tr> <tr> <td id="file-parsingxml-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-parsingxml-r-LC14" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-parsingxml-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-parsingxml-r-LC15" class="blob-code blob-code-inner js-file-line">#parsing XML to get variables</td> </tr> <tr> <td id="file-parsingxml-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-parsingxml-r-LC16" class="blob-code blob-code-inner js-file-line">ID <- as.numeric(sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Id"))}))</td> </tr> <tr> <td id="file-parsingxml-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-parsingxml-r-LC17" class="blob-code blob-code-inner js-file-line">Tags <- sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Tags"))})</td> </tr> <tr> <td id="file-parsingxml-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-parsingxml-r-LC18" class="blob-code blob-code-inner js-file-line">Title <- as.character(sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Title"))}))</td> </tr> <tr> <td id="file-parsingxml-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-parsingxml-r-LC19" class="blob-code blob-code-inner js-file-line">Body <- as.character(sapply(test,function(x){return(xpathSApply(x, "//row",xmlGetAttr, "Body"))}))</td> </tr> <tr> <td id="file-parsingxml-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-parsingxml-r-LC20" class="blob-code blob-code-inner js-file-line">text = paste(Title,Body)</td> </tr> <tr> <td id="file-parsingxml-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-parsingxml-r-LC21" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-parsingxml-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-parsingxml-r-LC22" class="blob-code blob-code-inner js-file-line">label = as.numeric(sapply(Tags,function(x){return(grep("python",x))}))</td> </tr> <tr> <td id="file-parsingxml-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-parsingxml-r-LC23" class="blob-code blob-code-inner js-file-line">label[is.na(label)]=0</td> </tr> <tr> <td id="file-parsingxml-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-parsingxml-r-LC24" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-parsingxml-r-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td> <td id="file-parsingxml-r-LC25" class="blob-code blob-code-inner js-file-line">#final data frame for export</td> </tr> <tr> <td id="file-parsingxml-r-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td> <td id="file-parsingxml-r-LC26" class="blob-code blob-code-inner js-file-line">df <- as.data.frame(cbind(ID,label,text),stringsAsFactors = FALSE)</td> </tr> <tr> <td id="file-parsingxml-r-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td> <td id="file-parsingxml-r-LC27" class="blob-code blob-code-inner js-file-line">df$ID=as.numeric(df$ID)</td> </tr> <tr> <td id="file-parsingxml-r-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td> <td id="file-parsingxml-r-LC28" class="blob-code blob-code-inner js-file-line">df$label=as.numeric(df$label)</td> </tr> <tr> <td id="file-parsingxml-r-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td> <td id="file-parsingxml-r-LC29" class="blob-code blob-code-inner js-file-line">#write to csv</td> </tr> <tr> <td id="file-parsingxml-r-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td> <td id="file-parsingxml-r-LC30" class="blob-code blob-code-inner js-file-line">write.csv(df, file=args[2],row.names=FALSE)</td> </tr> <tr> <td id="file-parsingxml-r-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td> <td id="file-parsingxml-r-LC31" class="blob-code blob-code-inner js-file-line">print("output file created....")</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d/raw/98b259ade11132ad87e9c4f476b7561b184cf041/parsingxml.R" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/Zoldin/47536af63182a0e8daf37a7b989e2e8d#file-parsingxml-r" class="Link--inTextBlock"> parsingxml.R </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p><strong>train_test_spliting.R</strong> — stratified sampling by target variable (here we are creating test and train data set)</p> <p></p><div id="gist71114469" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-train_test_splitting-r" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="train_test_splitting.R content, created by Zoldin on 08:42PM on July 21, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="train_test_splitting.R"> <tbody><tr> <td id="file-train_test_splitting-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-train_test_splitting-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td> </tr> <tr> <td id="file-train_test_splitting-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-train_test_splitting-r-LC2" class="blob-code blob-code-inner js-file-line">library(caret)</td> </tr> <tr> <td id="file-train_test_splitting-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-train_test_splitting-r-LC3" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-train_test_splitting-r-LC4" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td> </tr> <tr> <td id="file-train_test_splitting-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-train_test_splitting-r-LC5" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-train_test_splitting-r-LC6" class="blob-code blob-code-inner js-file-line">if (!length(args)==5) {</td> </tr> <tr> <td id="file-train_test_splitting-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-train_test_splitting-r-LC7" class="blob-code blob-code-inner js-file-line"> stop("Five arguments must be supplied (input file name, splitting ratio related to test data set, seed, train output file name, test output file name).n", call.=FALSE)</td> </tr> <tr> <td id="file-train_test_splitting-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-train_test_splitting-r-LC8" class="blob-code blob-code-inner js-file-line">} </td> </tr> <tr> <td id="file-train_test_splitting-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-train_test_splitting-r-LC9" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-train_test_splitting-r-LC10" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-train_test_splitting-r-LC11" class="blob-code blob-code-inner js-file-line">set.seed(as.numeric(args[3]))</td> </tr> <tr> <td id="file-train_test_splitting-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-train_test_splitting-r-LC12" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-train_test_splitting-r-LC13" class="blob-code blob-code-inner js-file-line">df <- read.csv(args[1],stringsAsFactors = FALSE)</td> </tr> <tr> <td id="file-train_test_splitting-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-train_test_splitting-r-LC14" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-train_test_splitting-r-LC15" class="blob-code blob-code-inner js-file-line">test.index <- createDataPartition(df$label, p = as.numeric(args[2]), list = FALSE)</td> </tr> <tr> <td id="file-train_test_splitting-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-train_test_splitting-r-LC16" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-train_test_splitting-r-LC17" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-train_test_splitting-r-LC18" class="blob-code blob-code-inner js-file-line">train <- df[-test.index,]</td> </tr> <tr> <td id="file-train_test_splitting-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-train_test_splitting-r-LC19" class="blob-code blob-code-inner js-file-line">test <- df[test.index,]</td> </tr> <tr> <td id="file-train_test_splitting-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-train_test_splitting-r-LC20" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-train_test_splitting-r-LC21" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_test_splitting-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-train_test_splitting-r-LC22" class="blob-code blob-code-inner js-file-line">write.csv(train, file=args[4],row.names=FALSE)</td> </tr> <tr> <td id="file-train_test_splitting-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-train_test_splitting-r-LC23" class="blob-code blob-code-inner js-file-line">write.csv(test, file=args[5],row.names=FALSE)</td> </tr> <tr> <td id="file-train_test_splitting-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-train_test_splitting-r-LC24" class="blob-code blob-code-inner js-file-line">print("train/test files created....")</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9/raw/e2106c39bad8a4ae04e41658bd287ea94ff7437a/train_test_splitting.R" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/Zoldin/7591c47ce5988cbe087e0038c9a850b9#file-train_test_splitting-r" class="Link--inTextBlock"> train_test_splitting.R </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p><strong>featurization.R</strong> — text mining and tf-idf matrix creation. In this part we are creating predictive variables.</p> <p></p><div id="gist71113907" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-featurization-r" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="featurization.R content, created by Zoldin on 08:39PM on July 21, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="featurization.R"> <tbody><tr> <td id="file-featurization-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-featurization-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td> </tr> <tr> <td id="file-featurization-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-featurization-r-LC2" class="blob-code blob-code-inner js-file-line">library(text2vec)</td> </tr> <tr> <td id="file-featurization-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-featurization-r-LC3" class="blob-code blob-code-inner js-file-line">library(MASS)</td> </tr> <tr> <td id="file-featurization-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-featurization-r-LC4" class="blob-code blob-code-inner js-file-line">library(Matrix)</td> </tr> <tr> <td id="file-featurization-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-featurization-r-LC5" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-featurization-r-LC6" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td> </tr> <tr> <td id="file-featurization-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-featurization-r-LC7" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-featurization-r-LC8" class="blob-code blob-code-inner js-file-line">if (!length(args)==4) {</td> </tr> <tr> <td id="file-featurization-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-featurization-r-LC9" class="blob-code blob-code-inner js-file-line"> stop("Four arguments must be supplied ( train file (csv format) ,test data set (csv format), train output file name and test output file name - txt files ).n", call.=FALSE)</td> </tr> <tr> <td id="file-featurization-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-featurization-r-LC10" class="blob-code blob-code-inner js-file-line">} </td> </tr> <tr> <td id="file-featurization-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-featurization-r-LC11" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-featurization-r-LC12" class="blob-code blob-code-inner js-file-line">#read input files</td> </tr> <tr> <td id="file-featurization-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-featurization-r-LC13" class="blob-code blob-code-inner js-file-line">df_train = read.csv(args[1],stringsAsFactors = FALSE)</td> </tr> <tr> <td id="file-featurization-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-featurization-r-LC14" class="blob-code blob-code-inner js-file-line">df_test = read.csv(args[2],stringsAsFactors = FALSE)</td> </tr> <tr> <td id="file-featurization-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-featurization-r-LC15" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-featurization-r-LC16" class="blob-code blob-code-inner js-file-line">#create vocabulary - words</td> </tr> <tr> <td id="file-featurization-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-featurization-r-LC17" class="blob-code blob-code-inner js-file-line">prep_fun = tolower</td> </tr> <tr> <td id="file-featurization-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-featurization-r-LC18" class="blob-code blob-code-inner js-file-line">tok_fun = word_tokenizer</td> </tr> <tr> <td id="file-featurization-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-featurization-r-LC19" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-featurization-r-LC20" class="blob-code blob-code-inner js-file-line">it_train = itoken(df_train$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = df_train$ID, progressbar = FALSE)</td> </tr> <tr> <td id="file-featurization-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-featurization-r-LC21" class="blob-code blob-code-inner js-file-line">vocab = create_vocabulary(it_train,stopwords = stop_words)</td> </tr> <tr> <td id="file-featurization-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-featurization-r-LC22" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-featurization-r-LC23" class="blob-code blob-code-inner js-file-line">#clean vocabualary - use only 5000 terms</td> </tr> <tr> <td id="file-featurization-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-featurization-r-LC24" class="blob-code blob-code-inner js-file-line">pruned_vocab <- prune_vocabulary(vocab, max_number_of_terms=5000)</td> </tr> <tr> <td id="file-featurization-r-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td> <td id="file-featurization-r-LC25" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td> <td id="file-featurization-r-LC26" class="blob-code blob-code-inner js-file-line">vectorizer = vocab_vectorizer(pruned_vocab)</td> </tr> <tr> <td id="file-featurization-r-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td> <td id="file-featurization-r-LC27" class="blob-code blob-code-inner js-file-line">dtm_train = create_dtm(it_train, vectorizer)</td> </tr> <tr> <td id="file-featurization-r-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td> <td id="file-featurization-r-LC28" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td> <td id="file-featurization-r-LC29" class="blob-code blob-code-inner js-file-line">#create tf-idf for train data set</td> </tr> <tr> <td id="file-featurization-r-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td> <td id="file-featurization-r-LC30" class="blob-code blob-code-inner js-file-line">tfidf = TfIdf$new()</td> </tr> <tr> <td id="file-featurization-r-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td> <td id="file-featurization-r-LC31" class="blob-code blob-code-inner js-file-line">dtm_train_tfidf = fit_transform(dtm_train, tfidf)</td> </tr> <tr> <td id="file-featurization-r-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td> <td id="file-featurization-r-LC32" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td> <td id="file-featurization-r-LC33" class="blob-code blob-code-inner js-file-line">#create test tf-idf - use vocabulary that is build on train</td> </tr> <tr> <td id="file-featurization-r-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td> <td id="file-featurization-r-LC34" class="blob-code blob-code-inner js-file-line">it_test = itoken(df_test$text, preprocessor = prep_fun, tokenizer = tok_fun, ids = df_test$ID, progressbar = FALSE)</td> </tr> <tr> <td id="file-featurization-r-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td> <td id="file-featurization-r-LC35" class="blob-code blob-code-inner js-file-line">dtm_test_tfidf = create_dtm(it_test, vectorizer) %>% </td> </tr> <tr> <td id="file-featurization-r-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td> <td id="file-featurization-r-LC36" class="blob-code blob-code-inner js-file-line"> transform(tfidf)</td> </tr> <tr> <td id="file-featurization-r-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td> <td id="file-featurization-r-LC37" class="blob-code blob-code-inner js-file-line">#add Id as additional column in matrices</td> </tr> <tr> <td id="file-featurization-r-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td> <td id="file-featurization-r-LC38" class="blob-code blob-code-inner js-file-line">dtm_train_tfidf<- Matrix(cbind(label=df_train$label,dtm_train_tfidf),sparse = TRUE)</td> </tr> <tr> <td id="file-featurization-r-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td> <td id="file-featurization-r-LC39" class="blob-code blob-code-inner js-file-line">dtm_test_tfidf<- Matrix(cbind(label=df_test$label,dtm_test_tfidf),sparse = TRUE)</td> </tr> <tr> <td id="file-featurization-r-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td> <td id="file-featurization-r-LC40" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td> <td id="file-featurization-r-LC41" class="blob-code blob-code-inner js-file-line"># write output - tf-idf matrices</td> </tr> <tr> <td id="file-featurization-r-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td> <td id="file-featurization-r-LC42" class="blob-code blob-code-inner js-file-line">writeMM(dtm_train_tfidf,args[3])</td> </tr> <tr> <td id="file-featurization-r-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td> <td id="file-featurization-r-LC43" class="blob-code blob-code-inner js-file-line">writeMM(dtm_test_tfidf,args[4])</td> </tr> <tr> <td id="file-featurization-r-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td> <td id="file-featurization-r-LC44" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-featurization-r-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td> <td id="file-featurization-r-LC45" class="blob-code blob-code-inner js-file-line">print("Two matrices were created - one for train and one for test data set")</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6/raw/2787bc21fa8b2591ca09102f38f544eb5d6cf032/featurization.R" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/Zoldin/9e79c047fd8ad7aa6596b0682aca83c6#file-featurization-r" class="Link--inTextBlock"> featurization.R </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p><strong>train_model.R</strong> — with created variables we are building logistic regression (LASSO).</p> <p></p><div id="gist71114340" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-train_model-r" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="train_model.R content, created by Zoldin on 08:41PM on July 21, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="train_model.R"> <tbody><tr> <td id="file-train_model-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-train_model-r-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">#</span>!/usr/bin/Rscript</span></td> </tr> <tr> <td id="file-train_model-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-train_model-r-LC2" class="blob-code blob-code-inner js-file-line">library(<span class="pl-smi">Matrix</span>)</td> </tr> <tr> <td id="file-train_model-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-train_model-r-LC3" class="blob-code blob-code-inner js-file-line">library(<span class="pl-smi">glmnet</span>)</td> </tr> <tr> <td id="file-train_model-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-train_model-r-LC4" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-train_model-r-LC5" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> three arguments needs to be provided - train file (.txt, matrix), seed and output name for RData file</span></td> </tr> <tr> <td id="file-train_model-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-train_model-r-LC6" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-train_model-r-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-v">args</span> <span class="pl-k">=</span> commandArgs(<span class="pl-v">trailingOnly</span><span class="pl-k">=</span><span class="pl-c1">TRUE</span>)</td> </tr> <tr> <td id="file-train_model-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-train_model-r-LC8" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-train_model-r-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-k">if</span> (<span class="pl-k">!</span>length(<span class="pl-smi">args</span>)<span class="pl-k">==</span><span class="pl-c1">3</span>) {</td> </tr> <tr> <td id="file-train_model-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-train_model-r-LC10" class="blob-code blob-code-inner js-file-line"> stop(<span class="pl-s"><span class="pl-pds">"</span>Three arguments must be supplied ( train file (.txt, matrix), seed and argument for RData model name).n<span class="pl-pds">"</span></span>, <span class="pl-v">call.</span><span class="pl-k">=</span><span class="pl-c1">FALSE</span>)</td> </tr> <tr> <td id="file-train_model-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-train_model-r-LC11" class="blob-code blob-code-inner js-file-line">} </td> </tr> <tr> <td id="file-train_model-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-train_model-r-LC12" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-train_model-r-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">#</span>read train data set </span></td> </tr> <tr> <td id="file-train_model-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-train_model-r-LC14" class="blob-code blob-code-inner js-file-line"><span class="pl-v">trainMM</span> <span class="pl-k">=</span> readMM(<span class="pl-smi">args</span>[<span class="pl-c1">1</span>])</td> </tr> <tr> <td id="file-train_model-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-train_model-r-LC15" class="blob-code blob-code-inner js-file-line">set.seed(as.numeric(<span class="pl-smi">args</span>[<span class="pl-c1">2</span>]))</td> </tr> <tr> <td id="file-train_model-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-train_model-r-LC16" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-train_model-r-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-c"><span class="pl-c">#</span>use regular matrix, not sparse</span></td> </tr> <tr> <td id="file-train_model-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-train_model-r-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-smi">trainMM_reg</span> <span class="pl-k"><-</span> as.matrix(<span class="pl-smi">trainMM</span>)</td> </tr> <tr> <td id="file-train_model-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-train_model-r-LC19" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-train_model-r-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-v">t1</span> <span class="pl-k">=</span> Sys.time()</td> </tr> <tr> <td id="file-train_model-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-train_model-r-LC21" class="blob-code blob-code-inner js-file-line">print(<span class="pl-s"><span class="pl-pds">"</span>Started to train the model... <span class="pl-pds">"</span></span>)</td> </tr> <tr> <td id="file-train_model-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-train_model-r-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-v">glmnet_classifier</span> <span class="pl-k">=</span> cv.glmnet(<span class="pl-v">x</span> <span class="pl-k">=</span> <span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">2</span><span class="pl-k">:</span><span class="pl-c1">500</span>], <span class="pl-v">y</span> <span class="pl-k">=</span> <span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">1</span>], </td> </tr> <tr> <td id="file-train_model-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-train_model-r-LC23" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">family</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>binomial<span class="pl-pds">'</span></span>, </td> </tr> <tr> <td id="file-train_model-r-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-train_model-r-LC24" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> L1 penalty</span></td> </tr> <tr> <td id="file-train_model-r-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td> <td id="file-train_model-r-LC25" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">alpha</span> <span class="pl-k">=</span> <span class="pl-c1">1</span>,</td> </tr> <tr> <td id="file-train_model-r-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td> <td id="file-train_model-r-LC26" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> interested in the area under ROC curve</span></td> </tr> <tr> <td id="file-train_model-r-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td> <td id="file-train_model-r-LC27" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">type.measure</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">"</span>auc<span class="pl-pds">"</span></span>,</td> </tr> <tr> <td id="file-train_model-r-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td> <td id="file-train_model-r-LC28" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> 5-fold cross-validation</span></td> </tr> <tr> <td id="file-train_model-r-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td> <td id="file-train_model-r-LC29" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">nfolds</span> <span class="pl-k">=</span> <span class="pl-c1">5</span>,</td> </tr> <tr> <td id="file-train_model-r-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td> <td id="file-train_model-r-LC30" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> high value is less accurate, but has faster training</span></td> </tr> <tr> <td id="file-train_model-r-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td> <td id="file-train_model-r-LC31" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">thresh</span> <span class="pl-k">=</span> <span class="pl-c1">1e-3</span>,</td> </tr> <tr> <td id="file-train_model-r-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td> <td id="file-train_model-r-LC32" class="blob-code blob-code-inner js-file-line"> <span class="pl-c"><span class="pl-c">#</span> again lower number of iterations for faster training</span></td> </tr> <tr> <td id="file-train_model-r-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td> <td id="file-train_model-r-LC33" class="blob-code blob-code-inner js-file-line"> <span class="pl-v">maxit</span> <span class="pl-k">=</span> <span class="pl-c1">1e3</span>)</td> </tr> <tr> <td id="file-train_model-r-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td> <td id="file-train_model-r-LC34" class="blob-code blob-code-inner js-file-line">print(<span class="pl-s"><span class="pl-pds">"</span>Model generated...<span class="pl-pds">"</span></span>)</td> </tr> <tr> <td id="file-train_model-r-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td> <td id="file-train_model-r-LC35" class="blob-code blob-code-inner js-file-line">print(difftime(Sys.time(), <span class="pl-smi">t1</span>, <span class="pl-v">units</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>sec<span class="pl-pds">'</span></span>))</td> </tr> <tr> <td id="file-train_model-r-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td> <td id="file-train_model-r-LC36" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td> <td id="file-train_model-r-LC37" class="blob-code blob-code-inner js-file-line"><span class="pl-v">preds</span> <span class="pl-k">=</span> predict(<span class="pl-smi">glmnet_classifier</span>, <span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">2</span><span class="pl-k">:</span><span class="pl-c1">500</span>], <span class="pl-v">type</span> <span class="pl-k">=</span> <span class="pl-s"><span class="pl-pds">'</span>response<span class="pl-pds">'</span></span>)[,<span class="pl-c1">1</span>]</td> </tr> <tr> <td id="file-train_model-r-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td> <td id="file-train_model-r-LC38" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td> <td id="file-train_model-r-LC39" class="blob-code blob-code-inner js-file-line">print(<span class="pl-s"><span class="pl-pds">"</span>AUC for the train... <span class="pl-pds">"</span></span>)</td> </tr> <tr> <td id="file-train_model-r-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td> <td id="file-train_model-r-LC40" class="blob-code blob-code-inner js-file-line"><span class="pl-e">glmnet</span><span class="pl-k">:::</span>auc(<span class="pl-smi">trainMM_reg</span>[,<span class="pl-c1">1</span>], <span class="pl-smi">preds</span>)</td> </tr> <tr> <td id="file-train_model-r-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td> <td id="file-train_model-r-LC41" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-train_model-r-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td> <td id="file-train_model-r-LC42" class="blob-code blob-code-inner js-file-line">save(<span class="pl-smi">glmnet_classifier</span>,<span class="pl-v">file</span><span class="pl-k">=</span><span class="pl-smi">args</span>[<span class="pl-c1">3</span>])</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf/raw/5f12bfcec59aeddd8428f9d9c571a243c2302ae6/train_model.R" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/Zoldin/1617b39f2acbde3cd486616ac442e7cf#file-train_model-r" class="Link--inTextBlock"> train_model.R </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p><strong>evaluate.R</strong> — with trained model we are predicting target on test data set. AUC is final output which is used as evaluation metric.</p> <p></p><div id="gist71113477" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-evaluate-r" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-r" style="overflow: auto" tabindex="0" role="region" aria-label="evaluate.r content, created by Zoldin on 08:37PM on July 21, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="evaluate.r"> <tbody><tr> <td id="file-evaluate-r-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-evaluate-r-LC1" class="blob-code blob-code-inner js-file-line">#!/usr/bin/Rscript</td> </tr> <tr> <td id="file-evaluate-r-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-evaluate-r-LC2" class="blob-code blob-code-inner js-file-line">library(Matrix)</td> </tr> <tr> <td id="file-evaluate-r-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-evaluate-r-LC3" class="blob-code blob-code-inner js-file-line">library(glmnet)</td> </tr> <tr> <td id="file-evaluate-r-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-evaluate-r-LC4" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluate-r-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-evaluate-r-LC5" class="blob-code blob-code-inner js-file-line">args = commandArgs(trailingOnly=TRUE)</td> </tr> <tr> <td id="file-evaluate-r-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-evaluate-r-LC6" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluate-r-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-evaluate-r-LC7" class="blob-code blob-code-inner js-file-line">if (!length(args)==3) {</td> </tr> <tr> <td id="file-evaluate-r-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-evaluate-r-LC8" class="blob-code blob-code-inner js-file-line"> stop("Three arguments must be supplied ( file name where model is stored (RDataname), test file (.txt, matrix) and file name for AUC output).n", call.=FALSE)</td> </tr> <tr> <td id="file-evaluate-r-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-evaluate-r-LC9" class="blob-code blob-code-inner js-file-line">} </td> </tr> <tr> <td id="file-evaluate-r-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-evaluate-r-LC10" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluate-r-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-evaluate-r-LC11" class="blob-code blob-code-inner js-file-line">#read test data set and model </td> </tr> <tr> <td id="file-evaluate-r-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-evaluate-r-LC12" class="blob-code blob-code-inner js-file-line">load(args[1])</td> </tr> <tr> <td id="file-evaluate-r-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-evaluate-r-LC13" class="blob-code blob-code-inner js-file-line">testMM = readMM(args[2])</td> </tr> <tr> <td id="file-evaluate-r-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-evaluate-r-LC14" class="blob-code blob-code-inner js-file-line">testMM_reg <- as.matrix(testMM)</td> </tr> <tr> <td id="file-evaluate-r-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-evaluate-r-LC15" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluate-r-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-evaluate-r-LC16" class="blob-code blob-code-inner js-file-line">#predict test data</td> </tr> <tr> <td id="file-evaluate-r-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-evaluate-r-LC17" class="blob-code blob-code-inner js-file-line">preds = predict(glmnet_classifier, testMM_reg[,2:500] , type = 'response')[, 1]</td> </tr> <tr> <td id="file-evaluate-r-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-evaluate-r-LC18" class="blob-code blob-code-inner js-file-line"> glmnet:::auc(testMM_reg[,1], preds)</td> </tr> <tr> <td id="file-evaluate-r-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-evaluate-r-LC19" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluate-r-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-evaluate-r-LC20" class="blob-code blob-code-inner js-file-line">#write AUC into txt file</td> </tr> <tr> <td id="file-evaluate-r-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-evaluate-r-LC21" class="blob-code blob-code-inner js-file-line">write.table(file=args[3],paste('AUC for the test file is : ',glmnet:::auc(testMM_reg[,1], preds)),row.names = FALSE,col.names = FALSE)</td> </tr> <tr> <td id="file-evaluate-r-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-evaluate-r-LC22" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-evaluate-r-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-evaluate-r-LC23" class="blob-code blob-code-inner js-file-line"> </td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d/raw/8044bf4a8bf9301113705332f6a26936bd89445b/evaluate.r" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/Zoldin/bfc2d4ee449098a9ff64b99c3326e61d#file-evaluate-r" class="Link--inTextBlock"> evaluate.r </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <p>Firstly, codes from above we will download into the new folder (clone the repository):</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">mkdir</span> R_DVC_GITHUB_CODE </span><span class="token line"><span class="token input">$ </span><span class="token command">cd</span> R_DVC_GITHUB_CODE </span> <span class="token line"><span class="token input">$ </span><span class="token git">git clone</span> https://github.com/Zoldin/R_AND_DVC</span></code></pre></div> <h2 id="dvc-installation-and-initialization" style="position:relative;">DVC installation and initialization<a href="#dvc-installation-and-initialization" aria-label="dvc installation and initialization permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>On the first site it seemed that DVC will not be compatible to work with R because of the fact that DVC is written in Python and as that needs/requires Python packages and pip package manager. Nevertheless, the tool can be used with any programming language, it is language agnostic and as such is excellent for working with R.</p> <p>Dvc installation:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">pip3</span> <span class="token function">install</span> dvc </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc init</span></span></code></pre></div> <p>With code below 5 R scripts with <code>dvc run</code> are executed. Each script is started with some arguments — input and output file names and other parameters (seed, splitting ratio etc). It is important to use <code>dvc run</code> — with this command R script are entering pipeline (DAG graph).</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc import</span> https://s3-us-west-2.amazonaws.com/dvc-public/data/tutorial/nlp/25K/Posts.xml.zip <span class="token punctuation">\</span> data/ </span> <span class="token comment"># Extract XML from the archive.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> <span class="token function">tar</span> zxf data/Posts.xml.tgz <span class="token parameter variable">-C</span> data/ </span> <span class="token comment"># Prepare data.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/parsingxml.R <span class="token punctuation">\</span> data/Posts.xml <span class="token punctuation">\</span> data/Posts.csv </span> <span class="token comment"># Split training and testing dataset. Two output files.</span> <span class="token comment"># 0.33 is the test dataset splitting ratio.</span> <span class="token comment"># 20170426 is a seed for randomization.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_test_spliting.R <span class="token punctuation">\</span> data/Posts.csv <span class="token number">0.33</span> <span class="token number">20170426</span> <span class="token punctuation">\</span> data/train_post.csv <span class="token punctuation">\</span> data/test_post.csv </span> <span class="token comment"># Extract features from text data.</span> <span class="token comment"># Two TSV inputs and two pickle matrices outputs.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/featurization.R <span class="token punctuation">\</span> data/train_post.csv <span class="token punctuation">\</span> data/test_post.csv <span class="token punctuation">\</span> data/matrix_train.txt <span class="token punctuation">\</span> data/matrix_test.txt </span> <span class="token comment"># Train ML model out of the training dataset.</span> <span class="token comment"># 20170426 is another seed value.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/train_model.R <span class="token punctuation">\</span> data/matrix_train.txt <span class="token number">20170426</span> <span class="token punctuation">\</span> data/glmnet.Rdata </span> <span class="token comment"># Evaluate the model by the testing dataset.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> Rscript code/evaluate.R <span class="token punctuation">\</span> data/glmnet.Rdata <span class="token punctuation">\</span> data/matrix_test.txt <span class="token punctuation">\</span> data/evaluation.txt </span> <span class="token comment"># The result.</span> <span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation.txt</span></code></pre></div> <h2 id="dependency-flow-graph-on-r-example" style="position:relative;">Dependency flow graph on R example<a href="#dependency-flow-graph-on-r-example" aria-label="dependency flow graph on r example permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>Dependency graph is shown on picture below:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 256.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/e9ba609b030acd01d27fcd1ff99a3f7f/bb9ec/dependency-graph.png" alt="Dependency graph" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span><em>Dependency graph</em></p> <p>DVC memorizes this dependencies and helps us in each moment to reproduce results.</p> <p>For example, lets say that we are changing our training model — using ridge penalty instead of lasso penalty (changing alpha parameter to <code>0</code>). In that case will change/modify <code>train_model.R</code> job and if we want to repeat model development with this algorithm we don’t need to repeat all steps from above, only steps marked red on a picture below:</p> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 256.5px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/da29b8bd00ccba3578fdfe91cd7f34bc/bb9ec/marked-steps.png" alt="marked steps" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>DVC knows based on DAG graph that changed <code>train_model.R</code> file will only change following files: <code>Glmnet.RData</code> and <code>Evaluation.txt</code>. If we want to see our new results we need to execute only <code>train_model.R</code> and <code>evaluate.R job</code>. It is cool that we don’t have to think all the time what we need to repeat (which steps). <a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> command will do that instead of us. Here is a code example :</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token command">vi</span> train_model.R </span><span class="token line"><span class="token input">$ </span><span class="token git">git commit</span> <span class="token parameter variable">-am</span> <span class="token string">"Ridge penalty instead of lasso"</span> </span><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/evaluation.txt </span> Reproducing run command for data item data/glmnet.Rdata. Args: Rscript code/train_model.R data/matrix_train.txt 20170426 data/glmnet.Rdata Reproducing run command for data item data/evaluation.txt. Args: Rscript code/evaluate.R data/glmnet.Rdata data/matrix_test.txt data/evaluation.txt <span class="token line"><span class="token input">$ </span><span class="token command">cat</span> data/evaluation.txt </span>"AUC for the test file is : 0.947697381983095"</code></pre></div> <p><a href="https://dvc.org/doc/command-reference/repro"><code>dvc repro</code></a> always re executes steps which are affected with the latest developer changes. It knows what needs to be reproduced.</p> <p>DVC can also work in an <em>"multi-user environment”</em> . Pipelines (dependency graphs) are visible to others colleagues if we are working in a team and using git as our version control tool. Data files can be shared if we set up a cloud and with <em>dvc sync</em> we specify which data can be shared and used for other users. In that case other users can see the shared data and reproduce results with those data and their code changes.</p> <h2 id="summary" style="position:relative;">Summary<a href="#summary" aria-label="summary permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>DVC tool improves and accelerates iterative development and helps to keep track of ML processes and file dependencies in the simple form. On the R example we saw how DVC memorizes dependency graph and based on that graph re executes only jobs that are related to the latest changes. It can also work in multi-user environment where dependency graphs, codes and data can be shared among multiple users. Because it is language agnostic, DVC allows us to work with multiple programming languages within a single data science project.</p>https://dvc.org/blog/how-data-scientists-can-improve-their-productivityhttps://dvc.org/blog/how-data-scientists-can-improve-their-productivityMon, 15 May 2017 00:00:00 GMT<p>Data science and machine learning are iterative processes. It is never possible to successfully complete a data science project in a single pass. A data scientist constantly tries new ideas and changes steps of her pipeline:</p> <ol> <li> <p>extract new features and accidentally find noise in the data;</p> </li> <li> <p>clean up the noise, find one more promising feature;</p> </li> <li> <p>extract the new feature;</p> </li> <li> <p>rebuild and validate the model, realize that the learning algorithm parameters are not perfect for the new feature set;</p> </li> <li> <p>change machine learning algorithm parameters and retrain the model;</p> </li> <li> <p>find the ineffective feature subset and remove it from the feature set;</p> </li> <li> <p>try a few more new features;</p> </li> <li> <p>try another ML algorithm. And then a data format change is required.</p> </li> </ol> <p>This is only a small episode in a data scientist’s daily life and it is what makes our job different from a regular engineering job.</p> <p>Business context, ML algorithm knowledge and intuition all help you to find a good model faster. But you never know for sure what ideas will bring you the best value.</p> <p>This is why the iteration time is a critical parameter in data science process. The quicker you iterate, the more you can check ideas and build a better model.</p> <blockquote> <p>“A well-engineered pipeline gets data scientists iterating much faster, which can be a big competitive edge” From <a href="http://blog.untrod.com/2012/10/engineering-practices-in-data-science.html" target="_blank" rel="nofollow noopener noreferrer">Engineering Practices in Data Science</a> By Chris Clark.</p> </blockquote> <h2 id="a-data-science-iteration-tool" style="position:relative;">A data science iteration tool<a href="#a-data-science-iteration-tool" aria-label="a data science iteration tool permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>To speed up the iterations in data science projects we have created an open source tool <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">data version control</a> or <a href="http://dvc.org" target="_blank" rel="nofollow noopener noreferrer">DVC.org</a>.</p> <p>DVC takes care of dependencies between commands that you run, generated data files, and code files and allows you to easily reproduce any steps of your research with regards to files changes.</p> <p>You can think about DVC as a Makefile for a data science project even though you do not create a file explicitly. DVC tracks dependencies in your data science projects when you run data processing or modeling code through a special command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python code/xml_to_tsv.py <span class="token punctuation">\</span> data/Posts.xml data/Posts.tsv</span></code></pre></div> <p><code>dvc run</code> works as a proxy for your commands. This allows DVC to track input and output files, construct the dependency graph (<a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph" target="_blank" rel="nofollow noopener noreferrer">DAG</a>), and store the command and parameters for a future command reproduction.</p> <p>The previous command will be automatically piped with the next command because of the file <code>data/Posts.tsv</code> is an output for the previous command and the input for the next one:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Split training and testing dataset. Two output files.</span> <span class="token comment"># 0.33 is the test dataset splitting ratio.</span> <span class="token comment"># 20170426 is a seed for randomization.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc run</span> python code/split_train_test.py <span class="token punctuation">\</span> data/Posts.tsv <span class="token number">0.33</span> <span class="token number">20170426</span> <span class="token punctuation">\</span> data/Posts-train.tsv data/Posts-test.tsv</span></code></pre></div> <p>DVC derives the dependencies automatically by looking to the list of the parameters (even if your code ignores the parameters) and noting the file changes before and after running the command.</p> <p>If you change one of your dependencies (data or code) then all the affected steps of the pipeline will be reproduced:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Change the data preparation code.</span> <span class="token line"><span class="token input">$ </span><span class="token command">vi</span> code/xml_to_tsv.py </span> <span class="token comment"># Reproduce.</span> <span class="token line"><span class="token input">$ </span><span class="token dvc">dvc repro</span> data/Posts-train.tsv </span>Reproducing run command for data item data/Posts.tsv. Reproducing run command for data item data/Posts-train.tsv.</code></pre></div> <p>This is a powerful way of quickly iterating through your pipeline.</p> <p>The pipeline might have a lot of steps and forms of acyclic dependencies between the steps. Below is an example of a canonical machine learning pipeline (more details in <a href="https://dvc.org/doc/tutorials" target="_blank" rel="nofollow noopener noreferrer">the DVC tutorials</a>:</p> <p></p><div id="gist47206784" class="gist"> <div class="gist-file" translate="no" data-color-mode="light" data-light-theme="light"> <div class="gist-data"> <div class="js-gist-file-update-container js-task-list-container"> <div id="file-dvc_pipeline-sh" class="file my-2"> <div itemprop="text" class="Box-body p-0 blob-wrapper data type-shell" style="overflow: auto" tabindex="0" role="region" aria-label="dvc_pipeline.sh content, created by dmpetrov on 07:11AM on April 30, 2017."> <div class="js-check-hidden-unicode js-blob-code-container blob-code-content"> <template class="js-file-alert-template"> <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> <span> This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a> </span> <div data-view-component="true" class="flash-action"> <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn"> Show hidden characters </a> </div> </div></template> <template class="js-line-alert-template"> <span aria-label="This line has hidden Unicode characters" data-view-component="true" class="line-alert tooltipped tooltipped-e"> <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-alert"> <path d="M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z"></path> </svg> </span></template> <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="dvc_pipeline.sh"> <tbody><tr> <td id="file-dvc_pipeline-sh-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td> <td id="file-dvc_pipeline-sh-LC1" class="blob-code blob-code-inner js-file-line"># Install DVC</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td> <td id="file-dvc_pipeline-sh-LC2" class="blob-code blob-code-inner js-file-line">$ pip install dvc</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td> <td id="file-dvc_pipeline-sh-LC3" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td> <td id="file-dvc_pipeline-sh-LC4" class="blob-code blob-code-inner js-file-line"># Initialize DVC repository</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td> <td id="file-dvc_pipeline-sh-LC5" class="blob-code blob-code-inner js-file-line">$ dvc init</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td> <td id="file-dvc_pipeline-sh-LC6" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td> <td id="file-dvc_pipeline-sh-LC7" class="blob-code blob-code-inner js-file-line"># Download a file and put to data/ directory.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td> <td id="file-dvc_pipeline-sh-LC8" class="blob-code blob-code-inner js-file-line">$ dvc import https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz data/</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td> <td id="file-dvc_pipeline-sh-LC9" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td> <td id="file-dvc_pipeline-sh-LC10" class="blob-code blob-code-inner js-file-line"># Extract XML from the archive.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td> <td id="file-dvc_pipeline-sh-LC11" class="blob-code blob-code-inner js-file-line">$ dvc run tar zxf data/Posts.xml.tgz -C data/</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td> <td id="file-dvc_pipeline-sh-LC12" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td> <td id="file-dvc_pipeline-sh-LC13" class="blob-code blob-code-inner js-file-line"># Prepare data.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td> <td id="file-dvc_pipeline-sh-LC14" class="blob-code blob-code-inner js-file-line">$ dvc run python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv python</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td> <td id="file-dvc_pipeline-sh-LC15" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td> <td id="file-dvc_pipeline-sh-LC16" class="blob-code blob-code-inner js-file-line"># Split training and testing dataset. Two output files.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td> <td id="file-dvc_pipeline-sh-LC17" class="blob-code blob-code-inner js-file-line"># 0.33 is the test dataset splitting ratio. 20170426 is a seed for randomization.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td> <td id="file-dvc_pipeline-sh-LC18" class="blob-code blob-code-inner js-file-line">$ dvc run python code/split_train_test.py data/Posts.tsv 0.33 20170426 data/Posts-train.tsv data/Posts-test.tsv</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td> <td id="file-dvc_pipeline-sh-LC19" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td> <td id="file-dvc_pipeline-sh-LC20" class="blob-code blob-code-inner js-file-line"># Extract features from text data. Two TSV inputs and two pickle matrixes outputs.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td> <td id="file-dvc_pipeline-sh-LC21" class="blob-code blob-code-inner js-file-line">$ dvc run python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td> <td id="file-dvc_pipeline-sh-LC22" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td> <td id="file-dvc_pipeline-sh-LC23" class="blob-code blob-code-inner js-file-line"># Train ML model out of the training dataset. 20170426 is another seed value.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td> <td id="file-dvc_pipeline-sh-LC24" class="blob-code blob-code-inner js-file-line">$ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td> <td id="file-dvc_pipeline-sh-LC25" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td> <td id="file-dvc_pipeline-sh-LC26" class="blob-code blob-code-inner js-file-line"># Evaluate the model by the testing dataset.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td> <td id="file-dvc_pipeline-sh-LC27" class="blob-code blob-code-inner js-file-line">$ dvc run python code/evaluate.py data/model.p data/matrix-test.p data/evaluation.txt</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td> <td id="file-dvc_pipeline-sh-LC28" class="blob-code blob-code-inner js-file-line"> </td> </tr> <tr> <td id="file-dvc_pipeline-sh-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td> <td id="file-dvc_pipeline-sh-LC29" class="blob-code blob-code-inner js-file-line"># The result.</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td> <td id="file-dvc_pipeline-sh-LC30" class="blob-code blob-code-inner js-file-line">$ cat data/evaluation.txt</td> </tr> <tr> <td id="file-dvc_pipeline-sh-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td> <td id="file-dvc_pipeline-sh-LC31" class="blob-code blob-code-inner js-file-line">AUC: 0.596182</td> </tr> </tbody></table> </div> </div> </div> </div> </div> <div class="gist-meta"> <a href="https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6/raw/166cf09a233861902f1765e9179c1dce556fdcf5/dvc_pipeline.sh" style="float:right" class="Link--inTextBlock">view raw</a> <a href="https://gist.github.com/dmpetrov/7704a5156bdc32c7379580a61e2fe3b6#file-dvc_pipeline-sh" class="Link--inTextBlock"> dvc_pipeline.sh </a> hosted with ❤ by <a class="Link--inTextBlock" href="https://github.com">GitHub</a> </div> </div> </div><p></p> <h2 id="why-are-regular-pipeline-tools-not-enough" style="position:relative;">Why are regular pipeline tools not enough?<a href="#why-are-regular-pipeline-tools-not-enough" aria-label="why are regular pipeline tools not enough permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <blockquote> <p>“Workflows are expected to be mostly static or slowly changing.” (See <a href="https://airflow.incubator.apache.org/" target="_blank" rel="nofollow noopener noreferrer">Airflow</a>.)</p> </blockquote> <p>Regular pipeline tools like <a href="http://airflow.incubator.apache.org" target="_blank" rel="nofollow noopener noreferrer">Airflow</a> and <a href="https://github.com/spotify/luigi" target="_blank" rel="nofollow noopener noreferrer">Luigi</a> are good for representing static and fault tolerant workflows. A huge portion of their functionality is created for monitoring, optimization and fault tolerance. These are very important and business critical problems. However, these problems are irrelevant to data scientists’ daily lives.</p> <p>Data scientists need a lightweight, dynamic workflow management system. In contrast to the traditional airflow-like system, DVC reflects the process of researching and looking for a great model (and pipeline), not optimizing and monitoring an existing one. This is why DVC is a good fit for iterative machine learning processes. When a good model was discovered with DVC, the result could be incorporated into a data engineering pipeline (Luigi or Airflow).</p> <h2 id="pipelines-and-data-sharing" style="position:relative;">Pipelines and data sharing<a href="#pipelines-and-data-sharing" aria-label="pipelines and data sharing permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>In addition to pipeline description, data reproduction and dynamic nature, DVC has one more important feature. It was designed in accordance with the best software engineering practices. DVC is based on Git. It keeps code, and stores DAG in the Git repository which allows you to share your research results. But it moves the actual file content outside the Git repository (in <code>.cache</code> directory which DVC includes in <code>.gitignore</code>) since Git is not designed to accommodate large data files.</p> <p>The data files can be shared between data scientists through cloud storages using a simple command:</p> <div class="gatsby-highlight" data-language="dvc"><pre class="language-dvc"><code class="language-dvc"><span class="token comment"># Data scientists 1 syncs data to the cloud.</span> <span class="token line"><span class="token input">$ </span><span class="token command">dvc</span> <span class="token function">sync</span> data/</span></code></pre></div> <p><span class="gatsby-resp-image-wrapper" style="position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 307px; "><img class="gatsby-resp-image-image" src="https://dvc.org/static/6890171452971f3e3cd847014a526e03/7fc5b/git-server-or-github.jpg" alt="git server or github" title="" loading="auto" decoding="async" style="max-width: 100%; margin: auto;"></span></p> <p>Currently, AWS S3 and GCP storage are supported by DVC.</p> <h2 id="conclusion" style="position:relative;">Conclusion<a href="#conclusion" aria-label="conclusion permalink" class="anchor after"><svg aria-hidden="true" height="16" width="16"><path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z"></path></svg> </a></h2> <p>The productivity of data scientists can be improved by speeding up iteration processes and the DVC tool takes care of this.</p> <p>We are very interested in your opinion and feedback. Please post your comments here or contact us on Twitter — <a href="https://twitter.com/FullStackML" target="_blank" rel="nofollow noopener noreferrer">FullStackML</a>.</p> <p>If you found this tool useful, <strong>please “star” the <a href="https://github.com/iterative/dvc" target="_blank" rel="nofollow noopener noreferrer">DVC Github repository</a></strong>.</p>